3 main techniques for automating document search and analysis on a company’s website

cyb_detective
OSINT Ambition
Published in
4 min readApr 11, 2024

--

Documents that are stored on a company’s website can be one of the most useful sources of information for an investigation. They may include emails, phone numbers, addresses, employee names, links to other company-related sites, inadvertently published financial and strategic information.

One of the most popular methods of searching for such documents are Google Dorks.

Few examples:

site:company.com inurl:fileadmin

site:company.com (filetype:pdf OR filetype:ppt OR filetype:xls)

site:company.com (contract OR “internal use only”) filetype:pdf

More examples you can find at Christina Lekati article OSINT Techniques for Sensitive Documents That Have Escaped Into The Clear Web

The biggest disadvantage of this method is that Google does NOT index all documents.

Some of them may be blocked from indexing by the no-follow attribute, some of them are simply not linked from any page of the site, some of them are simply new and Google has not yet had time to index them.

In addition, some investigators only use Google and a browser to search for documents. But it is very difficult to analyse a large volume of documents in this way.

This quick article will teach you how to find files that may not be indexed by Google and automatically analyse them.

Find directories

Install Katana (fast web crawler):

go install github.com/projectdiscovery/katana/cmd/katana@latest

Get list of website URLs:

katana -u owasp.org -o links.txt

Extract root directories from link list:

cat links.txt | grep -oP '^https?://(?:[^/]*/){2}' | sort -u | tee root-dirs.txt

This list is enough for us to move on to the next step.

But remember that to investigate deeply, you can try to find other existing directories with DirHunter and GoBuster, as well as directories with currently deleted files with WayMore and WayBackUrls.

Download files

Now let’s select some directory from the list and try to download files from it in “owasp” directory:

wget -r - no-parent owasp.org/corporate -P owasp

Be prepared for the fact that this may take some time. It depends on the size of the site you want to explore.

This command will allow you to do whatever you want with the files in your chosen site directory (use Grep, PDFgrep, Find, Fimages, Exiftool and many other tools).

But unfortunately, there are often so many files stored in the directories of sites that there is no purely physical possibility to download them (and yes, Owasp.org is not a good site for an example, use some other one).

But you can just get a list of links to files in a particular directory and save it to text file:


wget --spider -r --no-parent info.lidl/de -v -o lidl_links_spider.txt

From this text file, you can select the most interesting ones (for example, those with PDF letters in the path) and download them.

Extract all links from output file:

grep -o 'http[s]\?://[^ ]\+' lidl_links_spider.txt >lidl_links.txt

Filter links by keyword:

grep -E 'pdf' lidl_links.txt >lidl_pdf_links.txt

Download files:

wget -i lidl_pdf_links.txt -P lidl_pdf

Note that in this example we are facing a common problem!

The files downloaded do not have the appropriate extension in the file name. Let’s add it to all of them at once:

find lidl_pdf -type f -exec mv '{}' '{}'.pdf \;

And try to open some PDF file.

All other files with other extensions can be found and downloaded in the same way. For exploring sensitive info start with docx, xlsx, pptx, csv etc.

You can also try just downloading all the PDFs at once using this command:

wget -r -A .pdf -e robots=off -P pdf_dir cerambycidae.net

But unfortunately, it doesn’t always work and it doesn’t work for all sites.

Search sensitive info

Now let’s look at the metadata of the retrieved documents.

exiftool lidl_pdf | grep ^Creator

Similarly, you can extract images and text from PDF, search for some text using keywords or regular expressions…

I regularly make posts about automating file analysis. You can find basic information in my course Linux for OSINT. A 21-day course for beginners.

I also recommend the article to you:

8 basic methods of automating the collection of information from company websites

Thank you very much for your attention, dear readers! Thank you for staying with me!

--

--