I have previously
showed you can generate your own hash databases and also how to download and
use the NSRL databases from NIST. In this blog post I will describe how you most
effectively can use these hash databases to search for hashes and the steps
involved in getting there.
Removing duplicate entries
If you have
generated hashes from files located on disk cloning images or from shares with application
installers, the chances are very high that your databases has entries of files
that are named differently but has the same hash. I always try to keep the databases
I generate as small as possible and free from any duplicate entries of
identical files. To show how we can remove these duplicate entries I will use
the Unix commands ‘sort’ and ‘uniq’ on one of the databases I generated in a previous
blog post.
pmedina@forensic:~$
head -5 /tmp/example-rds.txt
"SHA-1","MD5","CRC32","FileName","FileSize","ProductCode","OpSystemCode","SpecialCode"
"d9c40dd2f1fb08927e773a0dc70d75fedd71549e","2e54d3fb64ff68607c38ecb482f5fa25","732c6df0","RedDrive.zip",1148549,0,"WIN",""
"d1fa60a19ca5095731eb78cd6f6c7e3eca2cf57c","70e5118a1d0cff1a59b820a919b15949","c224e555","install.msi",1413120,0,"WIN",""
"c9220e529ad1ca74d6d7b4a8a17529e326f617cf","a6ae9f9e02477526bbac1e97357141be","8b1ddae5","!_Columns",3328,0,"WIN",""
"86d90e616458eea7188605b9f601c6bb7b46aeaf","a76026797c61d04c1c9990366e48208e","3a0a30ce","!EventMapping",208,0,"WIN",""
pmedina@forensic:~$
head -1 /tmp/example-rds.txt > /tmp/example-rds-unique.txt
pmedina@forensic:~$
sort -n -k1,1 /tmp/example-rds.txt | uniq -w 42 >>
/tmp/example-rds-unique.txt
pmedina@forensic:~$
wc -l /tmp/example-rds*
2213 /tmp/example-rds.txt
1876 /tmp/example-rds-unique.txt
4089 total
pmedina@forensic:~$
Searching for a hash
Our hash
databases are in essence files using the ASCII text format and the classic way to
search a text file is by using the Unix command ‘grep’. This command is very
versatile and can be used in many different ways, including searching for
hashes in a hash database. The ‘grep’ command below is me searching for a MD5
checksum in the file ‘NSRLFile.txt’ from the “minimal” hashset I have
downloaded from one of the reduced RDS files from NIST.
pmedina@forensic:~/NSRL/RDS_236m$
time grep ",\"91D2AB5FDB8A07239B9ABB1945A167B7\"," NSRLFile.txt
"FFFFFF7C06B881D7C1F011688E996C11E1699E97","91D2AB5FDB8A07239B9ABB1945A167B7","C0202BF4","sharemfcRes.dll",7800,7464,"Win2kPro",""
real 0m52.153s
user 0m0.332s
sys 0m5.144s
The hash I was looking for is listed in the last entry in the database and it took 52 seconds on my system for ‘grep’ to find this row. The command ‘grep’ does a linear search using the Boyer–Moore string search algorithm and is in general a very efficient way to search for data in a file. For our needs however, we do have to try to speed up the search.
Indexing a hash database
A more
efficient way to search for a hash in a large database is to create an index
file and use a binary search algorithm to find the hash we are looking for. The
program ‘hfind’ that is included in The Sleuthkit by Brian Carrier does exactly
just that. It lets us create an index file of the hash database which the
program will later use to perform a binary search for the hash.
pmedina@forensic:~/NSRL/RDS_236m$
time hfind -i nsrl-md5 NSRLFile.txt
Index Created
real 11m57.155s
user 2m56.219s
sys 0m47.571s
pmedina@forensic:~/NSRL/RDS_236m$
ls NSRLFile.txt*
Creating the index will take some time but after ‘hfind’ has finished processing the database, the index file can be located in the same directory as the hash database you created the index for. This index file is named the same as your hash database but has the hash algorithm it is using appended to it as well as the file extension of ‘.idx’. In our case the file is named ‘NSRLFile.txt-md5.idx’.
Content of the index file
The
actual content of the index file is just the hash of the algorithm we are using
followed by an offset to where in the hash database more information can be
retrieved. The hashes are sorted so that the binary search algorithm will work.
pmedina@forensic:~/NSRL/RDS_236m$
head NSRLFile.txt-md5.idx
00000000000000000000000000000000000000000|nsrl
00000040F69913A27FF7401B8BF3CFD1|0000002821265904
00000238B43AFAF52EB6F9780D25173C|0000001002881130
000003160F7A0B3987C913B62902E379|0000000202283901
000003A20F4478A192448F93094B1984|0000002866111867
00000422D47441DBEF718394532CDD7A|0000002254621103
0000042B5591608E21A79651FC406B0F|0000003123859956
00000448C638A2AE7C4CB08960A627D8|0000001245035136
000005714BAD67F47BE661698262532D|0000002759161303
000005D643791EEABB5BB8AE6BB902EE|0000000126929587
Looking at the index file and searching for the same hash we searched for above, we see that the entry in the hash database that holds more information regarding the file is located at offset 3339032584.
pmedina@forensic:~/NSRL/RDS_236m$
grep 91D2AB5FDB8A07239B9ABB1945A167B7 NSRLFile.txt-md5.idx
91D2AB5FDB8A07239B9ABB1945A167B7|0000003339032584
Skipping to that offset in our hash database will take us exactly to the entry in the database that holds the information regarding the file we are looking for.
pmedina@forensic:~/NSRL/RDS_236m$
tail -c +0000003339032584 NSRLFile.txt
"FFFFFF7C06B881D7C1F011688E996C11E1699E97","91D2AB5FDB8A07239B9ABB1945A167B7","C0202BF4","sharemfcRes.dll",7800,7464,"Win2kPro",""
Size of the index file
The size
of the index file will depend on the format of your hash database and the hash
algorithm you use to create the index. In our case we created a MD5 index of a
hash database that is using the RDS format. If we want to calculate the size of
the index file, all we need to do is to take the number of lines in your hash database,
remove 1 line that contains the header and multiply that value with the number
of characters each line hold. In our case that value will be 50, 32 characters
for the MD5 hash plus one character that separates the 16 character offset
value that is followed by a line separator.
pmedina@forensic:~/NSRL/RDS_236m$
wc -l NSRLFile.txt
25892925
NSRLFile.txt
pmedina@forensic:~/NSRL/RDS_236m$
echo "25892925 -1; . * 50; . + 47" | bc
25892924
1294646200
1294646247
pmedina@forensic:~/NSRL/RDS_236m$
wc -c NSRLFile.txt-md5.idx
1294646247
NSRLFile.txt-md5.idx
The size
will be different if you are using a SHA-1 hash as your index or another
database format than the RDS.
Using the index
Searching
through the hash database with the use of the index file is a lot faster than using
the ‘grep’ command demonstrated above. I will again search for the same MD5
hash as I searched for before, this time using ‘hfind’ and the index I created.
pmedina@forensic:~/NSRL/RDS_236m$
time hfind NSRLFile.txt 91d2ab5FDB8A07239B9ABB1945A167b7
91d2ab5FDB8A07239B9ABB1945A167b7 sharemfcRes.dll
real 0m0.051s
user 0m0.000s
sys 0m0.004s
The
search completed in less than one second and printed the name of the file that
is associated with the hash. If you look carefully at the command above you will see that I mixed both upper and lower cases which 'hfind' can handle perfectly. Under most circumstances the fact the hash matched
an entry in the database is all you need to know to either discard the file or
alert on its presence. There is however situations where I would like to know more
information, like the full path to the file.
Using a hash reference database
They way
I usually go about when generating databases with hashes I want to search for, is
to generate one hash database in the RDS format as well as another one with just
the MD5 hashes of the files. The hash database that is using the RDS format
only includes the filename while the MD5 hash database has the full path to the
file that was processed. In most computer forensic investigation cases, having
the full path to the file that is being matched might not be so useful - the
exact location of the file you are investigating will most likely not match the
one you have in your hash database. Sometimes however, I have found that there
is a need to know the exact path to the file that produced the hash. This is
the reason I also generate a hash database that has the full path to the files,
the hash reference database.
In a
previous blog post I used hashdog to generate two hash databases from the same
set of files. The first database I generated is using the RDS format, the same database
that we removed the duplicates from above. The second one is a MD5 hash
database that was generated with the option ‘--md5-fullpath’ enabled. This is
my hash reference database that includes the full path to the files that were
processed. After created an index of the hash database in the RDS format, I am
using this database to search for a MD5 hash of a file I have on disk.
pmedina@forensic:~$
hfind /tmp/example-rds-unique.txt ae182dc797cd9ad2c025066692fc041b
ae182dc797cd9ad2c025066692fc041b System.dll
As you
can see, I get a match for my hash and the name of the file is returned to us.
If we want to know from were exactly the checksum was generated from, we can
simply grep for the same checksum in our reference database.
pmedina@forensic:~$
grep ae182dc797cd9ad2c025066692fc041b /tmp/example.txt
ae182dc797cd9ad2c025066692fc041b Firefox Setup
13.0.1.exe/core/uninstall/helper.exe/\x01\x1a/System.dll
ae182dc797cd9ad2c025066692fc041b Firefox Setup
13.0.1.exe/core/maintenanceservice_installer.exe/\x01\x1a/System.dll
ae182dc797cd9ad2c025066692fc041b Firefox Setup
13.0.1.exe/setup.exe/\x01\x1a/System.dll
The
location of the file that produced the hash is presented to us
and by looking at the result it is clear that the file in question is part of
the Firefox application. If you want you can also remove all the duplicates
from the reference database as we previously did for the RDS database above. Personally,
I like to keep the hash reference database in its original format, including duplicate
entries of the same file. This is the only reference I have if I ever want to
trace back the origin of the file I am analyzing.
Compressing a hash database
Using an
indexed hash database is a very fast and efficient way of searching through a
large database for a hash value. However, something to consider when indexing a
database is size. On top of having a large database that contains the hashes
and other information about the file, the index file itself will also add to
the disk space that is being used.
The
largest hash database I have on disk is my hash reference database - the database
that includes the full path to the files. This is a database I do not frequently
use so instead of creating an index file that will occupy even more disk space,
I will actually compress this database to save disk space. I will illustrate this
below by using the same database we used to ‘grep’ the MD5 hash in above and
the Unix command ‘gzip’.
pmedina@forensic:~/NSRL/RDS_236m$
gzip -c NSRLFile.txt > /tmp/NSRLFILE.txt.gz
pmedina@forensic:~/NSRL/RDS_236m$
gzip -l /tmp/NSRLFILE.txt.gz
compressed uncompressed ratio uncompressed_name
1784524236 3339032716 46.6% /tmp/NSRLFILE.txt
As you
can see from the command output above, by compressing the database we managed
to reduce its size to almost half its original size.
Searching in a compressed file
Now that
the database is compressed we cannot use the regular ‘grep’ command to search
the file. Instead we have to use a command like ‘zgrep’ that will uncompress
the file on the fly before using ‘grep’.
pmedina@forensic:~/NSRL/RDS_236m$
time zgrep "^\"FFFFFF7C06B881D7C1F011688E996C11E1699E97\","
/tmp/NSRLFILE.txt.gz
"FFFFFF7C06B881D7C1F011688E996C11E1699E97","91D2AB5FDB8A07239B9ABB1945A167B7","C0202BF4","sharemfcRes.dll",7800,7464,"Win2kPro",""
real 1m4.639s
user 0m40.735s
sys 0m10.917s
This
command takes a few seconds more to complete than searching through the
uncompressed database. In most cases, the gain you get in disk space is worth
the increase in processing time, the time it takes to search through the file.
Optimizing decompression
The
program ‘gzip’ was written almost 20 years ago and today there are other
programs that are better suited for the type of computer hardware we have
today. One replacement program for ‘gzip’ we can use is ‘pigz’, parallel implementation of gzip. This
program supports parallel execution in multiple threads and utilizes more
available CPU power on multicore or multiprocessor systems than the standard
gzip implementation. By replacing ‘gzip’ with ‘pigz’ in the ‘zgrep’ shell scrip
we can reduce the time it takes for us to search through the compressed hash
database.
pmedina@forensic:~/NSRL/RDS_236m$
diff /bin/zgrep /bin/pigzgrep
170c170
< (gzip -cdfq -- "$i" 5>&-;
echo $? >&5) 3>&- |
---
> (pigz -cdfq "$i" 5>&-;
echo $? >&5) 3>&- |
pmedina@forensic:~/NSRL/RDS_236m$
pmedina@forensic:~/NSRL/RDS_236m$
time pigzgrep
"^\"FFFFFF7C06B881D7C1F011688E996C11E1699E97\"," /tmp/NSRLFILE.txt.gz
"FFFFFF7C06B881D7C1F011688E996C11E1699E97","91D2AB5FDB8A07239B9ABB1945A167B7","C0202BF4","sharemfcRes.dll",7800,7464,"Win2kPro",""
real 0m54.983s
user 0m29.658s
sys 0m10.085s
By
switching to ‘pigz’ the time to process our compressed hash database only
increased by a few seconds compared to processing the uncompressed file. For
the way I am using these hash reference databases, these extra seconds are
acceptable, especially considering the disk space that is being saved.