Open Source Intelligence Gathering 101
======================================
_A Penetration Test almost always needs to begin with an extensive Information Gathering phase. This post talks about how Open Sources of information on the Internet can be used to build a profile of the target. The gathered data can be used to identify servers, domains, version numbers, vulnerabilities, mis-configurations, exploitable endpoints and sensitive information leakages. Read on!_
There is a ton of data that can be discovered via open source intelligence gathering techniques, especially for companies who have a large online presence. There is always some tiny piece of code, a tech’ forum question with elaborate details, a sub-domain that was long forgotten or even a PDF containing marketing material with metadata that can be used against a target site. Even simple Google searches can normally lead to interesting results. Here are _some_ of the things that we do once we have the client’s (domain) name (in no particular order):
1.Whois lookup to find the admin contact and other email addresses. These email addresses very often exist as valid users on the application as well. Email addresses can be searched through database leaks or through a search service like [HaveIBeenPwned](https://haveibeenpwned.com) that tells you if your email was found as part of a breach.
![](https://cdn-images-1.medium.com/max/1600/1*j75XCqSNPVzp9y8GAYxTyA.png)
Example of an email address found in a breach as searched on [https://haveibeenpwned.com](https://haveibeenpwned.com)
Apart from email addresses, whois queries can return IP history information, domain expiry dates and even phone numbers that can be used in Social Engineering attacks.
![](https://cdn-images-1.medium.com/max/1600/1*1HByjz1nYKKi4s3EuAhV_g.png)
whois.domaintools.com is an excellent place to query whois records
2. A Google advanced search using the _site_ operator, to restrict to the target domain, to find php (or any server side script filetype), txt or log files
```
site:\*.example.org ext:php | ext:txt | ext:log
```
On several occasions we have identified interesting files (log files for example) that contain sensitive information and full system path of the application using search queries like these. You can couple this query with a minus operator to exclude specific search results.
![](https://cdn-images-1.medium.com/max/1600/1*pOtl_CjECSnkkGBJxC6YsA.png)
A Google search for files that leak information via the phpinfo() function
3. Perform a search on the domain (and sub domains) for good old-fashioned documents. File types include PDF, Excel, Word and PowerPoint to begin with. These documents may contain information that you can use for other attacks. Often, the document’s metadata (author name etc.) contained in file properties can be used as a valid username on the application itself.
```
site:\*.example.org ext:pdf | ext:doc| ext:docx | ext:ppt | ext:pptx | ext:xls | ext:xlsx | ext:csv
```
You can download these files locally and run them through a document metadata extractor or view properties of each file to see what information is leaked.
To see all the options that can be used for searching data refer to [https://www.google.co.in/advanced\_search](https://www.google.co.in/advanced_search). Also, the [Google Hacking Database](https://www.exploit-db.com/google-hacking-database/) (now on exploit-db) allows you to use pre-crafted queries to search for specific and interesting things on the Internet.
![](https://cdn-images-1.medium.com/max/1600/1*QJiz9QCCagjYI0Ak9yg7NQ.png)
Google Hacking Database at exploit-db
4. Check the _robots.txt_ file for hidden, interesting directories. Most shopping carts, frameworks and content management systems have well defined directory structures. So the admin directory is a _/admin_ or a _/administration_ request away. If not, the _robots.txt_ will very likely contain the directory name you seek.
![](https://cdn-images-1.medium.com/max/1600/1*cfw6vicXCwQc1xh5PlZz_g.png)
robots.txt for a popular site
5. Look through the HTML source to identify carts/CMS/frameworks etc. Identifying the application type helps in focusing the attack to areas of the application that have vulnerable components (plugins and themes for example). For example, if you look at the page source and see _wp-content_ then you can be certain that you are looking at a WordPress site.
A lot of publicly available browser addons can also be used to identify website frameworks. Wappalyzer on Firefox does a pretty good job at identifying several different server types, server and client side frameworks and third party plugins on the site.
![](https://cdn-images-1.medium.com/max/1600/1*LLK4o3wfggc_zvHYhLDYSA.png)
wappalyzer in action on https://www.wordpress.org
6.More often than not, if the site you are looking at has been created by a third party vendor, then you will very likely see a variant of "Powered by Third-Party-Developer-Company" somewhere at the bottom of the home page.
Using this to follow your trail of information gathering to the contractor’s site can also become incredibly rewarding. Browsing through it may reveal types of frameworks and version numbers that they build upon. It is also very likely that the contractor’s have a test/admin account on your client’s site as part of their development plan.
In my experience, many site administrators/developers often use passwords that are a variation of the company name (client’s company or the contractor’s company) and some numbers with/without special characters at the end. For example, if the contractor company was called "Example Developers" then 001Example, Example001, 00example, example00 and so on are good password candidates to try on your client website’s login panel.
_(Watch out for our next post on how we used this technique to compromise and gain access to a client’s server and run shell commands on it.)_
7. Look through the LinkedIn profile of the company to identify senior managers, directors and non-technical staff. Very often, the weakest passwords belong to the non-tech management folk in many companies. Searching through the "About Us" page on the company website also can lead to finding soft targets.
Based on the discovery of a couple of emails, a standard format for usernames can be derived. Once the username format is understood, a list of email addresses and equivalent usernames can be created that can be then used to perform other attacks including brute force of login pages or even [exploiting weak password reset functionality](https://blog.appsecco.com/mass-account-pwning-or-how-we-hacked-multiple-user-accounts-using-weak-reset-tokens-for-passwords-c2d6c0831377). (On more than one occasion we have found it useful to search for email addresses and possible usernames which have resulted in complete application and server compromises due to the use of weak passwords.)
8.Perform IP address related checks. Very often applications can be compromised due to a different and weaker application hosted on the same IP (shared hosting). Using reverse IP lookups, you can identify additional targets to poke around. [Bing has an excellent search using IP feature.](https://help.bing.microsoft.com/#apex/18/en-US/10001/-1)
![](https://cdn-images-1.medium.com/max/1600/1*Dtm_iquz8ORZ5XF52dKIyQ.png)
Bing’s IP search can be used to find other websites hosted on the same server
The folks over at [you get signal](http://www.yougetsignal.com/tools/web-sites-on-web-server/) and [IP Address](http://www.ip-address.org/reverse-lookup/reverse-ip.php) provide a reverse lookup facility as well.
![](https://cdn-images-1.medium.com/max/1600/1*kBxqbCyMWL7cEObyyky5OA.png)
You can provide a domain name or an IP address at You Get Signal.
As part of the checks with IP addresses, it is important to also note the A and PTR records of a domain. Sometimes due to a misconfiguration, a different site maybe accessible when using the PTR or the A record of the site. This information can be obtained with the nslookup or the dig command
```
dig -x 8.8.8.8
nslookup 8.8.8.8
```
9.Enumerate sub domains to find low hanging fruit and weaker entry points to the client’s hosting infrastructure. Sub domain enumeration is easily one of the most important steps in assessing and discovering assets that a client has exposed online; either deliberately as part of their business or accidentally due to a misconfiguration.
Sub domain enumeration can be done using various tools like dnsrecon, subbrute, knock.py, using Google’s _site_ operator or sites like dnsdumpster and even virustotal.com. Most of these tools use a large dictionary of common descriptive words like _admin, pages, people, hr, downloads, blog, dev, staging_ etc. These words are appended to the primary domain — example.org, to create a list of possible sub domain names like _admin.example.org_, _pages.example.org_, _people.example.org_ etc. Each of these names can then be checked against a DNS server to verify if the entry exists.
![](https://cdn-images-1.medium.com/max/1600/1*xSFqM6qaDY6f5d5hddhALw.png)
Using dnsrecon to brute force sub domain names
10. Look for HTTP status codes and response headers for different kinds of resource requests. For a valid page, for a non existing page, for a page that redirects, for a directory name etc. Lookout for subtle typos, extra spaces and redundant values in the response headers.
![](https://cdn-images-1.medium.com/max/1600/1*H8zBQkiUGzTud1RRTRTKNA.png)
A very subtly broken X-Frame-Options header. The extra space at the beginning of the header nullifies the header itself.
Also, look out for [CSP headers](https://developer.mozilla.org/en-US/docs/Web/HTTP/CSP). These contain domain names and sources from where script loading may be allowed. Sometimes a typo in a domain name listed in a CSP header or an [insecure JavaScript hosting CDN may be your only way to executing an XSS payload :)](https://github.com/cure53/XSSChallengeWiki/wiki/H5SC-Minichallenge-3:-%22Sh*t,-it%27s-CSP!%22)
11.Search the domain name of the client through Shodan and Censys to find files, IP addresses, exposed services and error messages. The good folks at [Shodan](https://www.shodan.io) and [censys](https://censys.io/) have painstakingly port scanned the Internet, enumerated services and categorised their findings making them searchable with simple keywords. Both these services can be used to a find a ton of interesting things including [open cameras](https://media.blackhat.com/us-13/US-13-Heffner-Exploiting-Network-Surveillance-Cameras-Like-A-Hollywood-Hacker-WP.pdf), [Cisco devices](https://www.defcon.org/images/defcon-18/dc-18-presentations/Schearer/DEFCON-18-Schearer-SHODAN.pdf), [Hospital facilities management servers](https://www.blackhat.com/docs/asia-14/materials/Rios/Asia-14-Rios-Owning-A-Building-Exploiting-Access-Control-And-Facility-Management.pdf), [weakly configured telnet and snmp services and SCADA systems](https://www.blackhat.com/docs/asia-14/materials/Schloesser/Asia-14-Schloesser-Scan-All-The-Things.pdf). [Censys has been used in the past to find interesting endpoints that have hosted source code and entire docker images of complete apps.](https://avicoder.me/2016/07/22/Twitter-Vine-Source-code-dump/)
![](https://cdn-images-1.medium.com/max/1600/1*FQOpNMfAb7A36nD0A72U_A.png)
shodan can be used to find interesting files and devices
12.Lookup the client on code hosting services like github, gitlab, bitbucket etc. All sorts of interesting things can be found in code hosted online through searchable repositories [including web vulnerabilities, 0days in web apps, configuration issues](http://michenriksen.com/blog/gitrob-putting-the-open-source-in-osint/), [AWS and other secret keys](https://www.itnews.com.au/news/aws-urges-developers-to-scrub-github-of-secret-keys-375785).
[Developers often commit code with production passwords or API access keys](https://news.ycombinator.com/item?id=7411927) only to later realise and remove the sensitive information and make additional commits. However, using commit logs and checking out specific commits one can retrieve these sensitive pieces of information that can then be used to launch a full attack on the client’s hosted infrastructure.
Tools like [Gitrob](https://github.com/michenriksen/gitrob) can be used to query Github and search sensitive files from the command line itself for specific organisations.
13. Browse the site’s HTML source to identify if the client hosts any static content on the cloud. Content like images, js and css files maybe hosted on s3 buckets owned by the client. It may also be possible while performing standard reconnaissance to identify if the client uses cloud infra’ to host static/dynamic content. In such cases finding buckets that the client uses can be really rewarding if the client has misconfigured permissions on the buckets. [A ton of interesting information can be found in public facing buckets](https://elastic-security.com/2014/07/07/having-a-look-inside-public-amazons-buckets/).
Tools like [DigiNinja’s Bucket Finder](https://digi.ninja/projects/bucket_finder.php) can be used to automate the search process by brute forcing names of buckets . This tool requires a well curated list of bucket names and potential full URLs to be effective.
![](https://cdn-images-1.medium.com/max/1600/1*I-UpzxJV6OgLbh4FGgopQg.png)
A private bucket will not disclose files and resources
![](https://cdn-images-1.medium.com/max/1600/1*2hv9j6-GPGR9JuMMkwETYw.png)
A public bucket shows the names of files and resources. These files can then be downloaded using full URLs.
OSINT is an ever growing and continuously enhancing field of study in itself. Using the ones I’ve listed above and other techniques, it is possible to build the profile of the target and reveal several weaknesses, sometimes without even sending a single packet from your system their way.
This brings us to the end of this post. If there are techniques that you frequently use that have yielded you interesting results and if you would like to share those, please do leave a comment.
Until next time, happy hacking!!