Tuesday, July 10, 2012

Searching Hash Databases

I have previously showed you can generate your own hash databases and also how to download and use the NSRL databases from NIST. In this blog post I will describe how you most effectively can use these hash databases to search for hashes and the steps involved in getting there.

Removing duplicate entries
If you have generated hashes from files located on disk cloning images or from shares with application installers, the chances are very high that your databases has entries of files that are named differently but has the same hash. I always try to keep the databases I generate as small as possible and free from any duplicate entries of identical files. To show how we can remove these duplicate entries I will use the Unix commands ‘sort’ and ‘uniq’ on one of the databases I generated in a previous blog post.

pmedina@forensic:~$ head -5 /tmp/example-rds.txt
pmedina@forensic:~$ head -1 /tmp/example-rds.txt > /tmp/example-rds-unique.txt
pmedina@forensic:~$ sort -n -k1,1 /tmp/example-rds.txt | uniq -w 42 >> /tmp/example-rds-unique.txt
pmedina@forensic:~$ wc -l /tmp/example-rds*
  2213 /tmp/example-rds.txt
  1876 /tmp/example-rds-unique.txt
  4089 total

As you can see above, the hash database I removed the duplicate entries from is using the RDS format. Therefore, the commands above to will only work on a hash database that uses that format. By sorting the file and only keeping the lines that had a unique SHA-1 checksum, I managed to get rid of 337 entries in the database. This might not seem like a big gain at first but keep in mind that this database is a very small database that was generated from just a few files. Having databases with unique entries will save your disk space and even speed up your searching.

Searching for a hash
Our hash databases are in essence files using the ASCII text format and the classic way to search a text file is by using the Unix command ‘grep’. This command is very versatile and can be used in many different ways, including searching for hashes in a hash database. The ‘grep’ command below is me searching for a MD5 checksum in the file ‘NSRLFile.txt’ from the “minimal” hashset I have downloaded from one of the reduced RDS files from NIST.

pmedina@forensic:~/NSRL/RDS_236m$ time grep ",\"91D2AB5FDB8A07239B9ABB1945A167B7\"," NSRLFile.txt

real    0m52.153s
user    0m0.332s
sys     0m5.144s

The hash I was looking for is listed in the last entry in the database and it took 52 seconds on my system for ‘grep’ to find this row. The command ‘grep’ does a linear search using the Boyer–Moore string search algorithm and is in general a very efficient way to search for data in a file. For our needs however, we do have to try to speed up the search.

Indexing a hash database
A more efficient way to search for a hash in a large database is to create an index file and use a binary search algorithm to find the hash we are looking for. The program ‘hfind’ that is included in The Sleuthkit by Brian Carrier does exactly just that. It lets us create an index file of the hash database which the program will later use to perform a binary search for the hash.

pmedina@forensic:~/NSRL/RDS_236m$ time hfind -i nsrl-md5 NSRLFile.txt
Index Created

real    11m57.155s
user    2m56.219s
sys     0m47.571s
pmedina@forensic:~/NSRL/RDS_236m$ ls NSRLFile.txt*

Creating the index will take some time but after ‘hfind’ has finished processing the database, the index file can be located in the same directory as the hash database you created the index for. This index file is named the same as your hash database but has the hash algorithm it is using appended to it as well as the file extension of ‘.idx’. In our case the file is named ‘NSRLFile.txt-md5.idx’.

Content of the index file
The actual content of the index file is just the hash of the algorithm we are using followed by an offset to where in the hash database more information can be retrieved. The hashes are sorted so that the binary search algorithm will work.

pmedina@forensic:~/NSRL/RDS_236m$ head NSRLFile.txt-md5.idx

Looking at the index file and searching for the same hash we searched for above, we see that the entry in the hash database that holds more information regarding the file is located at offset 3339032584.

pmedina@forensic:~/NSRL/RDS_236m$ grep 91D2AB5FDB8A07239B9ABB1945A167B7 NSRLFile.txt-md5.idx

Skipping to that offset in our hash database will take us exactly to the entry in the database that holds the information regarding the file we are looking for.

pmedina@forensic:~/NSRL/RDS_236m$ tail -c +0000003339032584 NSRLFile.txt

Size of the index file
The size of the index file will depend on the format of your hash database and the hash algorithm you use to create the index. In our case we created a MD5 index of a hash database that is using the RDS format. If we want to calculate the size of the index file, all we need to do is to take the number of lines in your hash database, remove 1 line that contains the header and multiply that value with the number of characters each line hold. In our case that value will be 50, 32 characters for the MD5 hash plus one character that separates the 16 character offset value that is followed by a line separator.

pmedina@forensic:~/NSRL/RDS_236m$ wc -l NSRLFile.txt
25892925 NSRLFile.txt
pmedina@forensic:~/NSRL/RDS_236m$ echo "25892925 -1; . * 50; . + 47" | bc
pmedina@forensic:~/NSRL/RDS_236m$ wc -c NSRLFile.txt-md5.idx
1294646247 NSRLFile.txt-md5.idx

The size will be different if you are using a SHA-1 hash as your index or another database format than the RDS.

Using the index
Searching through the hash database with the use of the index file is a lot faster than using the ‘grep’ command demonstrated above. I will again search for the same MD5 hash as I searched for before, this time using ‘hfind’ and the index I created.

pmedina@forensic:~/NSRL/RDS_236m$ time hfind NSRLFile.txt 91d2ab5FDB8A07239B9ABB1945A167b7
91d2ab5FDB8A07239B9ABB1945A167b7        sharemfcRes.dll

real    0m0.051s
user    0m0.000s
sys     0m0.004s

The search completed in less than one second and printed the name of the file that is associated with the hash. If you look carefully at the command above you will see that I mixed both upper and lower cases which 'hfind' can handle perfectly. Under most circumstances the fact the hash matched an entry in the database is all you need to know to either discard the file or alert on its presence. There is however situations where I would like to know more information, like the full path to the file.

Using a hash reference database
They way I usually go about when generating databases with hashes I want to search for, is to generate one hash database in the RDS format as well as another one with just the MD5 hashes of the files. The hash database that is using the RDS format only includes the filename while the MD5 hash database has the full path to the file that was processed. In most computer forensic investigation cases, having the full path to the file that is being matched might not be so useful - the exact location of the file you are investigating will most likely not match the one you have in your hash database. Sometimes however, I have found that there is a need to know the exact path to the file that produced the hash. This is the reason I also generate a hash database that has the full path to the files, the hash reference database.
In a previous blog post I used hashdog to generate two hash databases from the same set of files. The first database I generated is using the RDS format, the same database that we removed the duplicates from above. The second one is a MD5 hash database that was generated with the option ‘--md5-fullpath’ enabled. This is my hash reference database that includes the full path to the files that were processed. After created an index of the hash database in the RDS format, I am using this database to search for a MD5 hash of a file I have on disk.

pmedina@forensic:~$ hfind /tmp/example-rds-unique.txt ae182dc797cd9ad2c025066692fc041b
ae182dc797cd9ad2c025066692fc041b        System.dll

As you can see, I get a match for my hash and the name of the file is returned to us. If we want to know from were exactly the checksum was generated from, we can simply grep for the same checksum in our reference database.

pmedina@forensic:~$ grep ae182dc797cd9ad2c025066692fc041b /tmp/example.txt
ae182dc797cd9ad2c025066692fc041b  Firefox Setup 13.0.1.exe/core/uninstall/helper.exe/\x01\x1a/System.dll
ae182dc797cd9ad2c025066692fc041b  Firefox Setup 13.0.1.exe/core/maintenanceservice_installer.exe/\x01\x1a/System.dll
ae182dc797cd9ad2c025066692fc041b  Firefox Setup 13.0.1.exe/setup.exe/\x01\x1a/System.dll

The location of the file that produced the hash is presented to us and by looking at the result it is clear that the file in question is part of the Firefox application. If you want you can also remove all the duplicates from the reference database as we previously did for the RDS database above. Personally, I like to keep the hash reference database in its original format, including duplicate entries of the same file. This is the only reference I have if I ever want to trace back the origin of the file I am analyzing.

Compressing a hash database
Using an indexed hash database is a very fast and efficient way of searching through a large database for a hash value. However, something to consider when indexing a database is size. On top of having a large database that contains the hashes and other information about the file, the index file itself will also add to the disk space that is being used.
The largest hash database I have on disk is my hash reference database - the database that includes the full path to the files. This is a database I do not frequently use so instead of creating an index file that will occupy even more disk space, I will actually compress this database to save disk space. I will illustrate this below by using the same database we used to ‘grep’ the MD5 hash in above and the Unix command ‘gzip’.

pmedina@forensic:~/NSRL/RDS_236m$ gzip -c NSRLFile.txt > /tmp/NSRLFILE.txt.gz
pmedina@forensic:~/NSRL/RDS_236m$ gzip -l /tmp/NSRLFILE.txt.gz
compressed        uncompressed  ratio uncompressed_name
1784524236          3339032716  46.6% /tmp/NSRLFILE.txt

As you can see from the command output above, by compressing the database we managed to reduce its size to almost half its original size.

Searching in a compressed file
Now that the database is compressed we cannot use the regular ‘grep’ command to search the file. Instead we have to use a command like ‘zgrep’ that will uncompress the file on the fly before using ‘grep’.

pmedina@forensic:~/NSRL/RDS_236m$ time zgrep "^\"FFFFFF7C06B881D7C1F011688E996C11E1699E97\"," /tmp/NSRLFILE.txt.gz

real    1m4.639s
user    0m40.735s
sys     0m10.917s

This command takes a few seconds more to complete than searching through the uncompressed database. In most cases, the gain you get in disk space is worth the increase in processing time, the time it takes to search through the file.

Optimizing decompression
The program ‘gzip’ was written almost 20 years ago and today there are other programs that are better suited for the type of computer hardware we have today. One replacement program for ‘gzip’ we can use is ‘pigz, parallel implementation of gzip. This program supports parallel execution in multiple threads and utilizes more available CPU power on multicore or multiprocessor systems than the standard gzip implementation. By replacing ‘gzip’ with ‘pigz’ in the ‘zgrep’ shell scrip we can reduce the time it takes for us to search through the compressed hash database.

pmedina@forensic:~/NSRL/RDS_236m$ diff /bin/zgrep /bin/pigzgrep
<     (gzip -cdfq -- "$i" 5>&-; echo $? >&5) 3>&- |
>     (pigz -cdfq "$i" 5>&-; echo $? >&5) 3>&- |
pmedina@forensic:~/NSRL/RDS_236m$ time pigzgrep "^\"FFFFFF7C06B881D7C1F011688E996C11E1699E97\"," /tmp/NSRLFILE.txt.gz

real    0m54.983s
user    0m29.658s
sys     0m10.085s

By switching to ‘pigz’ the time to process our compressed hash database only increased by a few seconds compared to processing the uncompressed file. For the way I am using these hash reference databases, these extra seconds are acceptable, especially considering the disk space that is being saved.

No comments:

Post a Comment