Tuesday, July 10, 2012

Searching Hash Databases


I have previously showed you can generate your own hash databases and also how to download and use the NSRL databases from NIST. In this blog post I will describe how you most effectively can use these hash databases to search for hashes and the steps involved in getting there.

Removing duplicate entries
If you have generated hashes from files located on disk cloning images or from shares with application installers, the chances are very high that your databases has entries of files that are named differently but has the same hash. I always try to keep the databases I generate as small as possible and free from any duplicate entries of identical files. To show how we can remove these duplicate entries I will use the Unix commands ‘sort’ and ‘uniq’ on one of the databases I generated in a previous blog post.

pmedina@forensic:~$ head -5 /tmp/example-rds.txt
"SHA-1","MD5","CRC32","FileName","FileSize","ProductCode","OpSystemCode","SpecialCode"
"d9c40dd2f1fb08927e773a0dc70d75fedd71549e","2e54d3fb64ff68607c38ecb482f5fa25","732c6df0","RedDrive.zip",1148549,0,"WIN",""
"d1fa60a19ca5095731eb78cd6f6c7e3eca2cf57c","70e5118a1d0cff1a59b820a919b15949","c224e555","install.msi",1413120,0,"WIN",""
"c9220e529ad1ca74d6d7b4a8a17529e326f617cf","a6ae9f9e02477526bbac1e97357141be","8b1ddae5","!_Columns",3328,0,"WIN",""
"86d90e616458eea7188605b9f601c6bb7b46aeaf","a76026797c61d04c1c9990366e48208e","3a0a30ce","!EventMapping",208,0,"WIN",""
pmedina@forensic:~$ head -1 /tmp/example-rds.txt > /tmp/example-rds-unique.txt
pmedina@forensic:~$ sort -n -k1,1 /tmp/example-rds.txt | uniq -w 42 >> /tmp/example-rds-unique.txt
pmedina@forensic:~$ wc -l /tmp/example-rds*
  2213 /tmp/example-rds.txt
  1876 /tmp/example-rds-unique.txt
  4089 total
pmedina@forensic:~$

As you can see above, the hash database I removed the duplicate entries from is using the RDS format. Therefore, the commands above to will only work on a hash database that uses that format. By sorting the file and only keeping the lines that had a unique SHA-1 checksum, I managed to get rid of 337 entries in the database. This might not seem like a big gain at first but keep in mind that this database is a very small database that was generated from just a few files. Having databases with unique entries will save your disk space and even speed up your searching.

Searching for a hash
Our hash databases are in essence files using the ASCII text format and the classic way to search a text file is by using the Unix command ‘grep’. This command is very versatile and can be used in many different ways, including searching for hashes in a hash database. The ‘grep’ command below is me searching for a MD5 checksum in the file ‘NSRLFile.txt’ from the “minimal” hashset I have downloaded from one of the reduced RDS files from NIST.

pmedina@forensic:~/NSRL/RDS_236m$ time grep ",\"91D2AB5FDB8A07239B9ABB1945A167B7\"," NSRLFile.txt
"FFFFFF7C06B881D7C1F011688E996C11E1699E97","91D2AB5FDB8A07239B9ABB1945A167B7","C0202BF4","sharemfcRes.dll",7800,7464,"Win2kPro",""

real    0m52.153s
user    0m0.332s
sys     0m5.144s

The hash I was looking for is listed in the last entry in the database and it took 52 seconds on my system for ‘grep’ to find this row. The command ‘grep’ does a linear search using the Boyer–Moore string search algorithm and is in general a very efficient way to search for data in a file. For our needs however, we do have to try to speed up the search.

Indexing a hash database
A more efficient way to search for a hash in a large database is to create an index file and use a binary search algorithm to find the hash we are looking for. The program ‘hfind’ that is included in The Sleuthkit by Brian Carrier does exactly just that. It lets us create an index file of the hash database which the program will later use to perform a binary search for the hash.

pmedina@forensic:~/NSRL/RDS_236m$ time hfind -i nsrl-md5 NSRLFile.txt
Index Created

real    11m57.155s
user    2m56.219s
sys     0m47.571s
pmedina@forensic:~/NSRL/RDS_236m$ ls NSRLFile.txt*

Creating the index will take some time but after ‘hfind’ has finished processing the database, the index file can be located in the same directory as the hash database you created the index for. This index file is named the same as your hash database but has the hash algorithm it is using appended to it as well as the file extension of ‘.idx’. In our case the file is named ‘NSRLFile.txt-md5.idx’.

Content of the index file
The actual content of the index file is just the hash of the algorithm we are using followed by an offset to where in the hash database more information can be retrieved. The hashes are sorted so that the binary search algorithm will work.

pmedina@forensic:~/NSRL/RDS_236m$ head NSRLFile.txt-md5.idx
00000000000000000000000000000000000000000|nsrl
00000040F69913A27FF7401B8BF3CFD1|0000002821265904
00000238B43AFAF52EB6F9780D25173C|0000001002881130
000003160F7A0B3987C913B62902E379|0000000202283901
000003A20F4478A192448F93094B1984|0000002866111867
00000422D47441DBEF718394532CDD7A|0000002254621103
0000042B5591608E21A79651FC406B0F|0000003123859956
00000448C638A2AE7C4CB08960A627D8|0000001245035136
000005714BAD67F47BE661698262532D|0000002759161303
000005D643791EEABB5BB8AE6BB902EE|0000000126929587

Looking at the index file and searching for the same hash we searched for above, we see that the entry in the hash database that holds more information regarding the file is located at offset 3339032584.

pmedina@forensic:~/NSRL/RDS_236m$ grep 91D2AB5FDB8A07239B9ABB1945A167B7 NSRLFile.txt-md5.idx
91D2AB5FDB8A07239B9ABB1945A167B7|0000003339032584

Skipping to that offset in our hash database will take us exactly to the entry in the database that holds the information regarding the file we are looking for.

pmedina@forensic:~/NSRL/RDS_236m$ tail -c +0000003339032584 NSRLFile.txt
"FFFFFF7C06B881D7C1F011688E996C11E1699E97","91D2AB5FDB8A07239B9ABB1945A167B7","C0202BF4","sharemfcRes.dll",7800,7464,"Win2kPro",""

Size of the index file
The size of the index file will depend on the format of your hash database and the hash algorithm you use to create the index. In our case we created a MD5 index of a hash database that is using the RDS format. If we want to calculate the size of the index file, all we need to do is to take the number of lines in your hash database, remove 1 line that contains the header and multiply that value with the number of characters each line hold. In our case that value will be 50, 32 characters for the MD5 hash plus one character that separates the 16 character offset value that is followed by a line separator.

pmedina@forensic:~/NSRL/RDS_236m$ wc -l NSRLFile.txt
25892925 NSRLFile.txt
pmedina@forensic:~/NSRL/RDS_236m$ echo "25892925 -1; . * 50; . + 47" | bc
25892924
1294646200
1294646247
pmedina@forensic:~/NSRL/RDS_236m$ wc -c NSRLFile.txt-md5.idx
1294646247 NSRLFile.txt-md5.idx

The size will be different if you are using a SHA-1 hash as your index or another database format than the RDS.

Using the index
Searching through the hash database with the use of the index file is a lot faster than using the ‘grep’ command demonstrated above. I will again search for the same MD5 hash as I searched for before, this time using ‘hfind’ and the index I created.

pmedina@forensic:~/NSRL/RDS_236m$ time hfind NSRLFile.txt 91d2ab5FDB8A07239B9ABB1945A167b7
91d2ab5FDB8A07239B9ABB1945A167b7        sharemfcRes.dll

real    0m0.051s
user    0m0.000s
sys     0m0.004s

The search completed in less than one second and printed the name of the file that is associated with the hash. If you look carefully at the command above you will see that I mixed both upper and lower cases which 'hfind' can handle perfectly. Under most circumstances the fact the hash matched an entry in the database is all you need to know to either discard the file or alert on its presence. There is however situations where I would like to know more information, like the full path to the file.

Using a hash reference database
They way I usually go about when generating databases with hashes I want to search for, is to generate one hash database in the RDS format as well as another one with just the MD5 hashes of the files. The hash database that is using the RDS format only includes the filename while the MD5 hash database has the full path to the file that was processed. In most computer forensic investigation cases, having the full path to the file that is being matched might not be so useful - the exact location of the file you are investigating will most likely not match the one you have in your hash database. Sometimes however, I have found that there is a need to know the exact path to the file that produced the hash. This is the reason I also generate a hash database that has the full path to the files, the hash reference database.
In a previous blog post I used hashdog to generate two hash databases from the same set of files. The first database I generated is using the RDS format, the same database that we removed the duplicates from above. The second one is a MD5 hash database that was generated with the option ‘--md5-fullpath’ enabled. This is my hash reference database that includes the full path to the files that were processed. After created an index of the hash database in the RDS format, I am using this database to search for a MD5 hash of a file I have on disk.

pmedina@forensic:~$ hfind /tmp/example-rds-unique.txt ae182dc797cd9ad2c025066692fc041b
ae182dc797cd9ad2c025066692fc041b        System.dll

As you can see, I get a match for my hash and the name of the file is returned to us. If we want to know from were exactly the checksum was generated from, we can simply grep for the same checksum in our reference database.

pmedina@forensic:~$ grep ae182dc797cd9ad2c025066692fc041b /tmp/example.txt
ae182dc797cd9ad2c025066692fc041b  Firefox Setup 13.0.1.exe/core/uninstall/helper.exe/\x01\x1a/System.dll
ae182dc797cd9ad2c025066692fc041b  Firefox Setup 13.0.1.exe/core/maintenanceservice_installer.exe/\x01\x1a/System.dll
ae182dc797cd9ad2c025066692fc041b  Firefox Setup 13.0.1.exe/setup.exe/\x01\x1a/System.dll

The location of the file that produced the hash is presented to us and by looking at the result it is clear that the file in question is part of the Firefox application. If you want you can also remove all the duplicates from the reference database as we previously did for the RDS database above. Personally, I like to keep the hash reference database in its original format, including duplicate entries of the same file. This is the only reference I have if I ever want to trace back the origin of the file I am analyzing.

Compressing a hash database
Using an indexed hash database is a very fast and efficient way of searching through a large database for a hash value. However, something to consider when indexing a database is size. On top of having a large database that contains the hashes and other information about the file, the index file itself will also add to the disk space that is being used.
The largest hash database I have on disk is my hash reference database - the database that includes the full path to the files. This is a database I do not frequently use so instead of creating an index file that will occupy even more disk space, I will actually compress this database to save disk space. I will illustrate this below by using the same database we used to ‘grep’ the MD5 hash in above and the Unix command ‘gzip’.

pmedina@forensic:~/NSRL/RDS_236m$ gzip -c NSRLFile.txt > /tmp/NSRLFILE.txt.gz
pmedina@forensic:~/NSRL/RDS_236m$ gzip -l /tmp/NSRLFILE.txt.gz
compressed        uncompressed  ratio uncompressed_name
1784524236          3339032716  46.6% /tmp/NSRLFILE.txt

As you can see from the command output above, by compressing the database we managed to reduce its size to almost half its original size.

Searching in a compressed file
Now that the database is compressed we cannot use the regular ‘grep’ command to search the file. Instead we have to use a command like ‘zgrep’ that will uncompress the file on the fly before using ‘grep’.

pmedina@forensic:~/NSRL/RDS_236m$ time zgrep "^\"FFFFFF7C06B881D7C1F011688E996C11E1699E97\"," /tmp/NSRLFILE.txt.gz
"FFFFFF7C06B881D7C1F011688E996C11E1699E97","91D2AB5FDB8A07239B9ABB1945A167B7","C0202BF4","sharemfcRes.dll",7800,7464,"Win2kPro",""

real    1m4.639s
user    0m40.735s
sys     0m10.917s

This command takes a few seconds more to complete than searching through the uncompressed database. In most cases, the gain you get in disk space is worth the increase in processing time, the time it takes to search through the file.

Optimizing decompression
The program ‘gzip’ was written almost 20 years ago and today there are other programs that are better suited for the type of computer hardware we have today. One replacement program for ‘gzip’ we can use is ‘pigz, parallel implementation of gzip. This program supports parallel execution in multiple threads and utilizes more available CPU power on multicore or multiprocessor systems than the standard gzip implementation. By replacing ‘gzip’ with ‘pigz’ in the ‘zgrep’ shell scrip we can reduce the time it takes for us to search through the compressed hash database.

pmedina@forensic:~/NSRL/RDS_236m$ diff /bin/zgrep /bin/pigzgrep
170c170
<     (gzip -cdfq -- "$i" 5>&-; echo $? >&5) 3>&- |
---
>     (pigz -cdfq "$i" 5>&-; echo $? >&5) 3>&- |
pmedina@forensic:~/NSRL/RDS_236m$
pmedina@forensic:~/NSRL/RDS_236m$ time pigzgrep "^\"FFFFFF7C06B881D7C1F011688E996C11E1699E97\"," /tmp/NSRLFILE.txt.gz
"FFFFFF7C06B881D7C1F011688E996C11E1699E97","91D2AB5FDB8A07239B9ABB1945A167B7","C0202BF4","sharemfcRes.dll",7800,7464,"Win2kPro",""

real    0m54.983s
user    0m29.658s
sys     0m10.085s

By switching to ‘pigz’ the time to process our compressed hash database only increased by a few seconds compared to processing the uncompressed file. For the way I am using these hash reference databases, these extra seconds are acceptable, especially considering the disk space that is being saved.

Friday, July 6, 2012

Post-processing hash databases from NIST

As mentioned in my previous blog post, the National Software Reference Library(NSRL), a part of the National Institute of Standards and Technology (NIST), generates and periodically releases hash databases called Reference Data Sets (RDS). Each entry in the RDS contains information such as the MD5, SHA-1 and CRC32 checksums of the file as well as its size and which product it belongs to. The RDS are published on downloadable .ISO which holds a .zip archive that contains five files;
  • NSRLFile.txt  - Main hash text file.
  • NSRLMfg.txt - Manufacturer listing.
  • NSRLOS.txt - Operating systems listing.
  • NSRLProd.txt - Product listing.
  • hashes.txt - SHA-1 hashes for the files above.


NIST also makes other files available to us, including a .zip archive called the "minimal" hashset. This hash database contains all the entries that the other RDS databases has but only lists one example of every file in the NSRL. It is this reduced hash database that I will be using in the examples below.

Products part of the database
Most forensic investigators that I know and ever talked to use the RDS from NIST as their ‘KnownGood’ hashset, actively discarding any file that generate a match in these databases. What NIST says regarding this matter is that the files used to generate the information in the RDS are actually files that are not known to be good nor known to be bad, just files that are known. Entries in the RDS are not only generated from files part of operating systems, but also from application that might be unwanted or even be considered as malicious by some organizations. It is recommended by NIST that the forensic examiner partition the RDS file so that any unwanted applications are excluded from the database.

To accomplish this we first need to process the ‘NSRLProd.txt’ file, a file that contains a list of all the products that NIST has generated hashes for in the hash database. To make it easier for us to understand exactly what is included in the RDS we need to create a uniquely sorted list of all the ‘ApplicationType’ fields, the last field of the comma separated file.

pmedina@forensic:~/NSRL/RDS_236m$ head NSRLProd.txt
"ProductCode","ProductName","ProductVersion","OpSystemCode","MfgCode","Language","ApplicationType"
1,"Norton Utilities","2.0 WinNT 4.0","WINNT","SYM","English","Utility"
2,"CRT","2.4","Gen","Unknown","English","Telnet"
7,"Harvard Graphics","3.0 Upgrade","DOS","SPC","English","Presentation"
8,"ScreenShow","N/A","DOS","SPC","English","Screen Saver"
9,"Norton Utilities","8","DOS","SYM","English","Utility"
9,"Norton Utilities","8","Gen","SYM","English","Utility"
9,"Norton Utilities","8","WIN","SYM","English","Utility"
14,"FastTrackSchedule","Windows","WIN","AEC","English","Calendar"
16,"Report Writer","N/A","DOS","CLA","English","Reports"
pmedina@forensic:~/NSRL/RDS_236m$ tail -n +2 NSRLProd.txt | awk -F,\” '{print $NF}' | tr -d '"' | sort -u
3d computer graphics and design
3d computer graphics and design,architectural
3D Landscaping amd Animation
3D Landscaping amd Animation,Design Suite,Tutorial
3D Landscaping amd Animation,Graphics Suite
3D Landscaping amd Animation,Modeling Software
Accessibility
Accessories
Accessories,Configuration & Management
Accounting
..
..
X Server
X Server,X Windows
X Windows
Year 2000
zip

Looking through the list, the application types that catches my attention right away are named ‘Cryptography’, ‘Disk Wiper’, ‘employee monitoring’, ‘Encryption’, ‘File Sharing’, ‘Hacker Tool’, ‘Keyboard Logger’, ‘p2p client’, ‘password recovery’, ‘privacy tool’ and ‘Steganography’. Even though you might find any files that are part of these application types, it does not automatically mean that the system you are investigating has been compromised. What is does mean is that you need to look into the matter. It might just be that there is a legitimate use for having products of the application type ‘Disk Wiper’ shown below installed on the system you are investigating.

pmedina@forensic:~/NSRL/RDS_236m$ grep -e 'Disk Wiper\"' NSRLProd.txt
18640,"Paragon Disk Wiper 8.5 Personal Edition","2007","190","731","English","Disk Wiper"
18640,"Paragon Disk Wiper 8.5 Personal Edition","2007","200","731","English","Disk Wiper"
18640,"Paragon Disk Wiper 8.5 Personal Edition","2007","204","731","English","Disk Wiper"
18640,"Paragon Disk Wiper 8.5 Personal Edition","2007","209","731","English","Disk Wiper"
18640,"Paragon Disk Wiper 8.5 Personal Edition","2007","226","731","English","Disk Wiper"
18640,"Paragon Disk Wiper 8.5 Personal Edition","2007","231","731","English","Disk Wiper"
18640,"Paragon Disk Wiper 8.5 Personal Edition","2007","237","731","English","Disk Wiper"
18641,"Absolute Disk Wiper","None","189","727","English","Disk Wiper"
18746,"Active@KillDisk Professional Suite 5.0","1999-2008","189","1211","English","Disk Wiper"
18748,"Active@ Eraser Professional 4.1","4.1","189","1211","English","Disk Wiper"
18750,"East-Tec Eraser 2008","c.1998-2008","189","1209","English","Disk Wiper"
22369,"Wipedrive Six","c. 2010","231","1543","English","Disk Wiper"
22369,"Wipedrive Six","c. 2010","237","1543","English","Disk Wiper"
22369,"Wipedrive Six","c. 2010","359","1543","English","Disk Wiper"
23309,"DISKExtinguisher","c. 2009","190","1544","English","Disk Wiper"
23309,"DISKExtinguisher","c. 2009","194","1544","English","Disk Wiper"
23309,"DISKExtinguisher","c. 2009","231","1544","English","Disk Wiper"
23309,"DISKExtinguisher","c. 2009","237","1544","English","Disk Wiper"
23310,"File Extinguisher","c. 2011","190","1544","English","Disk Wiper"
23310,"File Extinguisher","c. 2011","231","1544","English","Disk Wiper"
23310,"File Extinguisher","c. 2011","237","1544","English","Disk Wiper"
                       
Partitioning the database
Now that you have found the application types that you want to separate from your existing hash database we need to divide the file ‘NSRLFile.txt’, creating two new files, something NIST calls partitioning the hash database. There are many ways we can go about to do this but I will be using a program called ‘nsrlext.pl’, part of the ByteInvestigator toolkit by Tony Rodrigues. This program could initially only search for entries belonging to the application type ‘Hacker Tool’ and separate these entries from the rest of the database. Since I needed to do some more extensive searching and partitioning, I patched Tony’s program so it now has the possibility to search for more application types than just the ‘Hacker Tool’. Follow the instructions below how to download ‘nsrlext.pl’ and apply my patch.

pmedina@forensic:~/NSRL/RDS_236m$ wget --no-verbose http://downloads.sourceforge.net/project/byteinvestigato/byteinvestigatr/0.1.6/ByteInvestigator0.1.6.zip
2012-07-05 15:08:12 URL:http://iweb.dl.sourceforge.net/project/byteinvestigato/byteinvestigatr/0.1.6/ByteInvestigator0.1.6.zip [103132/103132] -> "ByteInvestigator0.1.6.zip" [1]
pmedina@forensic:~/NSRL/RDS_236m$ md5sum ByteInvestigator0.1.6.zip
4a9d5e3d004f95caabd4fe5ab1a70d2a  ByteInvestigator0.1.6.zip
pmedina@forensic:~/NSRL/RDS_236m$ 7z x ByteInvestigator0.1.6.zip nsrlext.pl

7-Zip [64] 9.20  Copyright (c) 1999-2010 Igor Pavlov  2010-11-18
p7zip Version 9.20 (locale=en_US.UTF-8,Utf16=on,HugeFiles=on,1 CPU)

Processing archive: ByteInvestigator0.1.6.zip

Extracting  nsrlext.pl

Everything is Ok

Size:       2585
Compressed: 103132
pmedina@forensic:~/NSRL/RDS_236m$ cat ../nsrlext.pl-v01.patch
12a13
> use strict;
14c15
< my $ver="0.1";
---
> my $ver="0.1m";
17,18c18,19
< %args = ( );
< getopts("hn:p:g:b:", \%args);
---
> my %args = ( );
> getopts("hn:p:g:b:s:", \%args);
21c22,26
< if ($args{h}) {
---
> my $error_msg="\n";
> unless ($args{n}){$error_msg.="Enter the NSRL hashset file list (comma delimited)\n";$args{h}="";}
> unless ($args{p}){$error_msg.="Enter the NSRL product file list (comma delimited)\n";$args{h}="";}
> 
> if (defined $args{h}) {
29a35
>  -s :string to search for. Default "Hacker Tool". Ex: -s "Web Builder|Presentation"
32a39
>    print "$error_msg\n";
36,41c43,45
< die "Enter the NSRL hashset file list (comma delimited)\n" unless ($args{n});
< die "Enter the NSRL product file list (comma delimited)\n" unless ($args{p});
< 
< die "Enter known good and/or known bad output filenames\n" unless (($args{g}) || ($args{b}));
< 
< my %hack;
---
> my (%hack,$string);
> if ($args{s}){$string=$args{s};}
> else {$string="Hacker Tool";}
47,48c51,52
< 
< foreach $item (@prod) {
---
> print "searching for products labeled: $string\n";
> foreach my $item (@prod) {
56c60,63
<       $hack{$line[0]} = $item if ($line[6] =~ /Hacker Tool/);
---
>       $line[6]=~s/\W+$//;
>       $line[6]=~s/^"//;
>       $line[6]=~s/"$//;
>       $hack{$line[0]} = $_ if ($line[6] =~m/^($string)$/i);
60a68,71
> print "excluding the following products:\n";
> foreach my $product (keys %hack){print "$hack{$product}\n";}
> 
> unless (($args{g}) || ($args{b})){print "\nPlease use '-g' and '-b' to specify the files to store the result of the hash database partition\n";exit;}
71c82,83
< foreach $item (@hset) {
---
> foreach my $item (@hset) {
> print "processing hash file: $item\n";
75a88,94
>       if ($i==0){
>               print BAD $_ if ($args{b});
>               print GOOD $_ if ($args{g});
>                       $i++;
>               next;
>       }
> 
77c96
<       print ">" if (($i % 10000) == 0);
---
>       if ($i == 10000){print STDERR ".";$i=1;}
108a128
> Modified by Par Osterberg Medina
111a132
> return ();
pmedina@forensic:~/NSRL/RDS_236m$ patch nsrlext.pl -i ../nsrlext.pl-v01.patch -o nsrlext-v01m.pl
patching file nsrlext.pl
pmedina@forensic:~/NSRL/RDS_236m$ chmod +x nsrlext-v01m.pl
pmedina@forensic:~/NSRL/RDS_236m$ ./nsrlext-v01m.pl

nsrlext.pl v0.1m
Extracts known good and known bad hashsets from NSRL
Tony Rodrigues
dartagnham at gmail dot com
Modified by Par Osterberg Medina
--------------------------------------------------------------------------

 uso: nsrlext.pl -n nsrl_files_comma_separated -p nsrl_prod_files_comma_separated [-g known_good_txt] [-b known_bad_txt] [-h]

 -n :nsrl files comma separated. Ex: -n c:\nsrl\RDA_225_A\NSRLFile.txt,c:\nsrl\RDA_225_B\NSRLFile.txt
 -p :nsrl prod files comma separated. Ex: -p c:\nsrl\RDA_225_A\NSRLProd.txt,c:\nsrl\RDA_225_B\NSRLProd.txt
 -g :known good txt filename. Ex: -g good.txt
 -b :known bad txt filename. Ex: -b bad.txt
 -s :string to search for. Default "Hacker Tool". Ex: -s "Web Builder|Presentation"
 -h :help


Enter the NSRL hashset file list (comma delimited)
Enter the NSRL product file list (comma delimited)

pmedina@forensic:~/NSRL/RDS_236m$

For us to partition the hash database and actually split the ‘NSRLFile.txt’ file in two parts we need to specify a couple of switches to the ‘nsrlext.pl’ program. First we need to use the switch ‘-m’ and give the program the path to our hash database, the ‘NSRLFile.txt’. The next step is to specify the path to ‘NSRLProd.txt’, the file that contains all the product codes for the entries in the hash database. We also need to specify which ‘ApplicationType’ strings to search for so we can build the list of products we are going to separate from the hash database. This is accomplished by using the switch ‘-s’ and specifying the names you want to search for. In the example below I am searching for the same application types that previously caught my attention above.

pmedina@forensic:~/NSRL/RDS_236m$ ./nsrlext-v01m.pl -n NSRLFile.txt -p NSRLProd.txt -s "Cryptography|Disk Wiper|employee monitoring|Encryption|File Sharing|Hacker Tool|Keyboard Logger|p2p client|password recovery|privacy tool|Steganography"

nsrlext.pl v0.1m
Extracts known good and known bad hashsets from NSRL
Tony Rodrigues
dartagnham at gmail dot com
Modified by Par Osterberg Medina
--------------------------------------------------------------------------

searching for products labeled: Cryptography|Disk Wiper|employee monitoring|Encryption|File Sharing|Hacker Tool|Keyboard Logger|p2p client|password recovery|privacy tool|Steganography
excluding the following products:
19296,"Spector CNE Investigator","1998-2008","254","903","English","employee monitoring"
20722,"uTorrent","0.9.2","395","1429","English","File Sharing"
9755,"ProDiscover Basic and ZeroView","2002-2006","WIN","TPathways","English","Encryption"
21763,"iMesh","10","189","1451","English","p2p client"
6490,"Hack Attacks Denied Second Edition","NA","WIN","Wiley","English","Hacker Tool"
3297,"Guide to Hacking Software Security 2002","1.0","WINXP","Silv","English","Hacker Tool"
..
..
6510,"Invisible KeyLogger 2000","NA","UNK","Amecisco","English","Keyboard Logger"
21805,"Limewire","4.14.8","189","1448","English","p2p client"
21777,"Limewire","5.5.14","189","1448","English","p2p client"
2099,"RPing","NA","WIN","Unknown","English","Hacker Tool"
4525,"Spector CNE","4.1","WINXP","Spectorsoft","Unknown","Keyboard Logger"
2102,"PortScan","NA","WIN","Unknown","English","Hacker Tool"

Please use '-g' and '-b' to specify the files to store the result of the hash database partition
pmedina@forensic:~/NSRL/RDS_236m$

If everything looks correct and the products you want to separate are listed in the command output, the only thing left to do is to specify the path to the files to store the result in. This is done by using the switches ‘-g’ and ‘-b’. The entries that matches the products we searched for will be put in the file specified with the ‘-b’ switch and all other entries will be put in the file specified with the ‘-g’ switch. Even though the ‘-g’ stands for good and the ‘-b’ stands for bad, I will not put the entries that matches my search criteria in the ‘KnownBad’ hash category. Instead I will put the database with the matches in the ‘KnownForbidden’ category explained in this blog post.

pmedina@forensic:~/NSRL/RDS_236m$ sudo ./nsrlext-v01m.pl -n NSRLFile.txt -p NSRLProd.txt -s "Cryptography|Disk Wiper|employee monitoring|Encryption|File Sharing|Hacker Tool|Keyboard Logger|p2p client|password recovery|privacy tool|Steganography" -g /opt/forensic/database/knowngood/RDS_236m-good.txt -b /opt/forensic/database/knownforbidden/RDS_236m-forbidden.txt

nsrlext.pl v0.1m
Extracts known good and known bad hashsets from NSRL
Tony Rodrigues
dartagnham at gmail dot com
Modified by Par Osterberg Medina
--------------------------------------------------------------------------

searching for products labeled: Cryptography|Disk Wiper|employee monitoring|Encryption|File Sharing|Hacker Tool|Keyboard Logger|p2p client|password recovery|privacy tool|Steganography
excluding the following products:
19296,"Spector CNE Investigator","1998-2008","254","903","English","employee monitoring"
20722,"uTorrent","0.9.2","395","1429","English","File Sharing"
9755,"ProDiscover Basic and ZeroView","2002-2006","WIN","TPathways","English","Encryption"
..
..
processing hash file: NSRLFile.txt
...................... OUTPUT REMOVED
Done !
pmedina@forensic:~/NSRL/RDS_236m$ wc -l NSRLFile.txt
25892925 NSRLFile.txt
pmedina@forensic:~/NSRL/RDS_236m$ wc -l /opt/forensic/database/knowngood/RDS_236m-good.txt /opt/forensic/database/knownforbidden/RDS_236m-forbidden.txt
  25804471 /opt/forensic/database/knowngood/RDS_236m-good.txt
     88455 /opt/forensic/database/knownforbidden/RDS_236m-forbidden.txt
  25892926 total
pmedina@forensic:~/NSRL/RDS_236m$

The NSRL hash database has now been separated into two hash databases. The first database is placed in our ‘KnownGood’ category and any files that match entries in that database will be automatically discarded. The second database that we created is a database that holds entries of files that are not allowed in our organization - generating alerts if any of these files are detected on a system we are investigating.

Thursday, July 5, 2012

Categories for hash databases



By now you have most likely started to generate some hash databases of your own using hashdog and it is time to start to put them to use. In this blog post I will describe how I usually go about to categorize my hash databases and how I use them. You might want do things differently based on the type of forensic investigations you are involved in or the type of environment you are supporting.

Common hash categories
You are today probably already using the Reference Data Set (RDS) from the National Software Reference Library (NSRL) as one of your hash databases. These databases have been created by NIST in a controlled environment and contain hashes from application or operating system that are mostly generated from files still on their original media. These files are known to be good and are usually put in a hash category called the ‘KnownGood’. This category contains hash databases of files that are known to be benign, files that you are not interested in investigating further.

You might even have hash databases that contain hashes of malicious files that you want to search for. Those databases are part of a hash category known as the ‘KnownBad’ category. As with the ‘KnownGood’ category you do not really want to spend your time analyzing any files you find matches for in this category. If you get a match for a hash in any of the databases you have in this category, chances are high that the malware already has been analyzed by multiple organizations before you came across it. A time better spend is trying to figure out how the malware got on the system you are investigating and if other data pieces like registry keys and tmp files are consistent with previous analysis that has been done. After all, the file could just have been placed on your system to throw you off and keep you from finding the real anomaly.

Extending the KnownGood and KnownBad categories
When I started to write hashdog and was generating hash databases of my own, it did not seem right to put some of my databases in the ‘KnownGood’ category. When I was creating databases from files downloaded directly from the vendor or extracted from a verified ISO image, I put the hash databases I generated in the ‘KnownGood” category. However, when I was generating hashes from standard OS build images and files listed in application shares, I had no really good way of guarantying that the files was absolutely free from malware. After thinking a lot and discussing it with Glenn, I came up with a solution that works for the kind of forensic investigations I am mostly involved in – looking for anomalies in a system that could indicate that the system has been compromised.

Instead of using just the two hash categories mentioned above I decided to use a third and a forth category, calling them the ‘KnownUsed’ and ‘KnownForbidden’ categories. The ‘KnownUsed’ category contains databases of hashes generated from files that are actively being used by the organization I am investigating. Any hits I get from hashes part of databases in this category are treated differently than any matches I get from one of my ‘KnownGood’ databases. For instance, if a file has a hash that is part of any of the ‘KnownGood’ databases, that file will be discarded immediately without any further analysis being made. If I get a hit for a hash part of one of the hash databases in the ‘KnownUsed’ category, the file will not be completely discarded but I will not pay so much attention to the file, at least not at my initial analysis. The argument for this is that if I am looking for an anomaly, it is highly unlikely that this anomaly is a file that I have previously generated a hash for and included in my ‘KnownUsed’ database.

The forth category that I call the ‘KnownForbidden’ contains databases of hashes created from files part of applications that is not allowed within the organization. Common hashes to put in this category are generated from files belonging to non-cooperate encryption software such as Truecrypt, penetration testing software like Metasploit and privacy and cleaning tools like CCleaner. These are applications that are not malicious in them self but could indicate a malicious use if they are found on a system I am investigating. As with the other categories mentioned above, I want to get alerted if any files are detected but I do not want to analyze any files. To sums things up these are the categories I use and they way I threat any matches I get for files in the hash databases.

  • KnownGood - Discard any files from further analysis.
  • KnownBad - Alert on any matches but do not analyze the file.
  • KnownUsed - Put these files aside for later analysis.
  • KnownForbidden - Alert on any matches but do not analyze the file.


By breaking up the hash categories this way it is easier for me to focus on the files that I have not seen before, the unknown file whose functionality is not known.

Friday, June 29, 2012

Using hashdog

Now that you have successfully installed hashdog it is time to start generating some databases that we can start using in our forensic investigations. Hashdog uses the option ‘--input’ for specifying the file or the directory to recursively generate hashes for.  As I mentioned in my previous blog post, hashdog uses 7-Zip for extracting archives. The program will actually check all the files it processes if the file in question can be extracted. If that is the case, hashdog will extract the file and generate hashes for the content of the archive it just extracted as well. It will continue to do so until there are no more files to extract in the archive and then move on to the next file in the directory that is being processed. To illustrate how this is being done, I am going to take you through an example were we build our own hash database.


Generating your own hash database
In this example I have three file that I want to generate a database with MD5 hashes for. That is done by using the switch ‘--md5-file’ and specifying the path to the file were we want to store the results in, our hash database. As input to hashdog am I using a couple of files I have downloaded from the Internet, namely the installer for Firefox version 13.0.1 and Red Drive from JSCAPE. I also copied the ‘ls’ binary from my Debian system to the directory that I am going to process.


pmedina@forensic:~/hashdog$ ls -l /files/
total 17452
-rw-r--r-- 1 pmedina pmedina 16577248 Jun 26 07:47 Firefox Setup 13.0.1.exe
-rwxr-xr-x 1 root    root      108008 Jun 28 09:57 ls
-rw-r--r-- 1 pmedina pmedina  1148549 Oct 24  2011 RedDrive.zip
pmedina@forensic:~/hashdog$ ./hashdog.pl --input /files --md5sum-file /tmp/example.txt
pmedina@forensic:~/hashdog$ ./hashdog.pl --input /files --md5sum-file /tmp/example.txt
[*] hashdog.pl version: 0.72 written by Par Osterberg Medina
[-] minimum filesize to process: 1 bytes
[-] archive binary: 7-Zip [64] 9.20  Copyright (c) 1999-2010 Igor Pavlov  2010-11-18
[-] using tmp folder: /tmp/hd830002
[+] processing files recursivly from: /files
[+] RedDrive.zip
[-] extracting archive: zip
[+] RedDrive.zip/install.msi
[-] extracting archive: Compound
[+] RedDrive.zip/install.msi/!_Columns
[+] RedDrive.zip/install.msi/!EventMapping
[+] RedDrive.zip/install.msi/Binary.InstallUtil
[-] extracting archive: PE
[+] RedDrive.zip/install.msi/Binary.InstallUtil/.rsrc/VERSION/1
[+] RedDrive.zip/install.msi/Binary.InstallUtil/.data
[+] RedDrive.zip/install.msi/Binary.InstallUtil/.reloc
[+] RedDrive.zip/install.msi/Binary.InstallUtil/.text
[+] RedDrive.zip/install.msi/[5]SummaryInformation
[+] RedDrive.zip/install.msi/!Media
[+] RedDrive.zip/install.msi/Binary._70B6BD6470D90F593F71019EF5DC9D42
..
..


The complete output of the command above is too large to include here but I will start looking at the beginning of the command execution and explain what happens. The first file that we process and generate a checksum for is the ‘RedDrive.zip’ file itself. That file is identified by 7-Zip as a zip archive and is therefore extracted. The zip archive contains one file called ‘install.msi’ that is also identified as an archive. After calculating the MD5 checksum for the ‘install.msi' file the content of the .msi archive is also extracted. Within the .msi archive there are a bunch of files that hashdog generate checksums for and extract if possible. 


Archive types to skip
As you can see in the command output above 7-Zip will also try to extract the resources sections contained in a Portable Executable (PE) file, something we might not always want to do. The same behavior is also observed for the Linux binary ‘ls’ that I previously copied to the directory we are processing.


[+] ls
[-] extracting archive: ELF
[+] ls/4
[+] ls/6
[+] ls/5
[+] ls/0
[+] ls/2
[+] ls/3
[+] ls/1


In most cases we do not want to extract the content of ELF binaries and PE files and by using the switch ‘--archive-skip’ we can specify which archive types we want to exclude from extraction. The option ‘--archive-skip’ takes a case insensitive comma separated list of archives that should not be processed and in our case we want to specify the archives PE and ELF.


pmedina@forensic:~/hashdog$ wc -l /tmp/example.txt
2644 /tmp/example.txt
pmedina@forensic:~/hashdog$ ./hashdog.pl --input /files --md5sum-file /tmp/example.txt --archive-skip=PE,ELF
[*] hashdog.pl version: 0.72 written by Par Osterberg Medina
[-] minimum filesize to process: 1 bytes
[-] archive binary: 7-Zip [64] 9.20  Copyright (c) 1999-2010 Igor Pavlov  2010-11-18
[-] using tmp folder: /tmp/hd950156
[+] processing files recursivly from: /files
[+] RedDrive.zip
[-] extracting archive: zip
[+] RedDrive.zip/install.msi
[-] extracting archive: Compound
..
..
[-] extracting archive: PENsis
[+] Firefox Setup 13.0.1.exe/setup.exe/\x01_1
[+] Firefox Setup 13.0.1.exe/setup.exe/\x01_6
[+] Firefox Setup 13.0.1.exe/setup.exe/\x01_2
[+] Firefox Setup 13.0.1.exe/setup.exe/\x01\x01/\x01\x1a/InstallOptions.dll
[+] Firefox Setup 13.0.1.exe/setup.exe/\x01\x01/\x01\x1a/modern-header.bmp
[+] Firefox Setup 13.0.1.exe/setup.exe/\x01\x01/\x01\x1a/ServicesHelper.dll
[+] Firefox Setup 13.0.1.exe/setup.exe/\x01\x01/\x01\x1a/ioSpecial.ini
[+] Firefox Setup 13.0.1.exe/setup.exe/\x01\x01/\x01\x1a/modern-wizard.bmp
[+] Firefox Setup 13.0.1.exe/setup.exe/\x01_5
[+] Firefox Setup 13.0.1.exe/setup.exe/\x01\x15/\x01\x1a/nsExec.dll
[+] Firefox Setup 13.0.1.exe/setup.exe/\x01\x15/\x01\x1a/Banner.dll
[+] Firefox Setup 13.0.1.exe/setup.exe/\x01
[+] Firefox Setup 13.0.1.exe/setup.exe/\x01\x1a/System.dll
[+] Firefox Setup 13.0.1.exe/setup.exe/\x01_3
[+] Firefox Setup 13.0.1.exe/setup.exe/\x01_4
[+] done, finished in: 0 hours, 0 minutes and 16 seconds
[-] deleting the tmp directory
pmedina@forensic:~/hashdog$ wc -l /tmp/example.txt
2212 /tmp/example.txt


Our database has now 432 less entries than it did before we added the ‘--archive-skip’ option.

Full file path in the hashset
Opening the file that we created and looking at the result of the command that was run above we see that the file has two columns, the MD5 checksum and the name of the file.


pmedina@forensic:~/hashdog$ head /tmp/example.txt
2e54d3fb64ff68607c38ecb482f5fa25  RedDrive.zip
70e5118a1d0cff1a59b820a919b15949  install.msi
a6ae9f9e02477526bbac1e97357141be  !_Columns
a76026797c61d04c1c9990366e48208e  !EventMapping
4b6f4f52de80f1a7890c9bd0a7cac5e3  Binary.InstallUtil
45d11bc27761a502bce036adbcf64f7d  [5]SummaryInformation
d2dd55a6b2d6d768ab6254c169d41ce9  !Media
8806ebee0e08ab6338d0fdc87be83fc4  Binary._70B6BD6470D90F593F71019EF5DC9D42
d6b3635d8e144efae4ab1753695c19af  !Dialog
8e097d7e4ebf2f6cde863fbd7de296e6  !Feature


As you can see above, the file name in the hashset is not the full path to the file that we generated the checksum for. Having the full path to the file in the hash database will increase the size of the file a lot and as in the case with installers, the path to your file will almost never match the path to the file you have on disk. There are however some cases were you want to include to full path to the file in your database. This can be accomplished by specifying the ‘--md5sum-fullpath’ switch.


pmedina@forensic:~/hashdog$ wc -c /tmp/example.txt
113263 /tmp/example.txt
pmedina@forensic:~/hashdog$ ./hashdog.pl --input /files --md5sum-file /tmp/example.txt --archive-skip=PE,ELF --md5sum-fullpath
[*] hashdog.pl version: 0.72 written by Par Osterberg Medina
[-] minimum filesize to process: 1 bytes
[-] archive binary: 7-Zip [64] 9.20  Copyright (c) 1999-2010 Igor Pavlov  2010-11-18
[-] using tmp folder: /tmp/hd950156
[+] processing files recursivly from: /files
[+] RedDrive.zip
[-] extracting archive: zip
..
..
[+] Firefox Setup 13.0.1.exe/setup.exe/\x01\x15/\x01\x1a/Banner.dll
[+] Firefox Setup 13.0.1.exe/setup.exe/\x01
[+] Firefox Setup 13.0.1.exe/setup.exe/\x01\x1a/System.dll
[+] Firefox Setup 13.0.1.exe/setup.exe/\x01_3
[+] Firefox Setup 13.0.1.exe/setup.exe/\x01_4
[+] done, finished in: 0 hours, 0 minutes and 15 seconds
[-] deleting the tmp directory 
pmedina@forensic:~/hashdog$ wc -c /tmp/example.txt
271032 /tmp/example.txt
pmedina@forensic:~/hashdog$ head /tmp/example.txt
2e54d3fb64ff68607c38ecb482f5fa25  RedDrive.zip
70e5118a1d0cff1a59b820a919b15949  RedDrive.zip/install.msi
a6ae9f9e02477526bbac1e97357141be  RedDrive.zip/install.msi/!_Columns
a76026797c61d04c1c9990366e48208e  RedDrive.zip/install.msi/!EventMapping
4b6f4f52de80f1a7890c9bd0a7cac5e3  RedDrive.zip/install.msi/Binary.InstallUtil
45d11bc27761a502bce036adbcf64f7d  RedDrive.zip/install.msi/[5]SummaryInformation
d2dd55a6b2d6d768ab6254c169d41ce9  RedDrive.zip/install.msi/!Media
8806ebee0e08ab6338d0fdc87be83fc4  RedDrive.zip/install.msi/Binary._70B6BD6470D90F593F71019EF5DC9D42
d6b3635d8e144efae4ab1753695c19af  RedDrive.zip/install.msi/!Dialog
8e097d7e4ebf2f6cde863fbd7de296e6  RedDrive.zip/install.msi/!Feature


The size of our file increased quite a lot when we started to include the full path in our database. It got 157769 bytes larger, an increase in size by almost 42%. The question you have to ask yourself when you are generating our hash databases is - does adding the full path in the file justify the increase in file size?


Generating your own RDS file
Even though ‘hashdog’ can be used to generate hash databases in a lot of different formats, the format that will probably be of most use to us as forensic examiners and incident responders is the RDS format. This is the data format that NIST uses in their NSRL Reference Data Set (RDS) and can be imported in a lot of tools. For a deeper dive into the format, I suggest you read the PDF “Data Formats of the NSRL Reference Data Set (RDS) Distribution”. http://www.nsrl.nist.gov/Documents/Data-Formats-of-the-NSRL-Reference-Data-Set-16.pdf


As in the previous example above, all we need to if we want to generate a hash database using the RDS format, is to specify the switch ‘--rds-file’ followed by the path to the file we want to store the result in. If we want to, we can also choose to store the full path to the files in the database by using the switch ‘--rds-fullpath’. 


pmedina@forensic:~/hashdog$ ./hashdog.pl --input /files --md5sum-file /tmp/example.txt --archive-skip=PE,ELF --md5sum-fullpath --rds-file /tmp/example-rds.txt
[*] hashdog.pl version: 0.72 written by Par Osterberg Medina
[-] minimum filesize to process: 1 bytes
[-] archive binary: 7-Zip [64] 9.20  Copyright (c) 1999-2010 Igor Pavlov  2010-11-18
[-] using tmp folder: /tmp/hd950156
[+] processing files recursivly from: /files
[+] RedDrive.zip
[-] extracting archive: zip
..
..
[+] Firefox Setup 13.0.1.exe/setup.exe/\x01_3
[+] Firefox Setup 13.0.1.exe/setup.exe/\x01_4
[+] done, finished in: 0 hours, 0 minutes and 16 seconds
[-] deleting the tmp directory
pmedina@forensic:~/hashdog$ wc -l /tmp/example.txt /tmp/example-rds.txt
  2212 /tmp/example.txt
  2213 /tmp/example-rds.txt
  4425 total
pmedina@forensic:~/hashdog$ wc -c /tmp/example.txt /tmp/example-rds.txt
271032 /tmp/example.txt
274652 /tmp/example-rds.txt
545684 total
pmedina@forensic:~/hashdog$ head /tmp/example-rds.txt
"SHA-1","MD5","CRC32","FileName","FileSize","ProductCode","OpSystemCode","SpecialCode"
"d9c40dd2f1fb08927e773a0dc70d75fedd71549e","2e54d3fb64ff68607c38ecb482f5fa25","732c6df0","RedDrive.zip",1148549,0,"WIN",""
"d1fa60a19ca5095731eb78cd6f6c7e3eca2cf57c","70e5118a1d0cff1a59b820a919b15949","c224e555","install.msi",1413120,0,"WIN",""
"c9220e529ad1ca74d6d7b4a8a17529e326f617cf","a6ae9f9e02477526bbac1e97357141be","8b1ddae5","!_Columns",3328,0,"WIN",""
"86d90e616458eea7188605b9f601c6bb7b46aeaf","a76026797c61d04c1c9990366e48208e","3a0a30ce","!EventMapping",208,0,"WIN",""
"e45efe29240c68452730fc32327eb3048a162e2d","4b6f4f52de80f1a7890c9bd0a7cac5e3","fa0f9bcc","Binary.InstallUtil",55296,0,"WIN",""
"e397c2f9d0ec530a8453193aa5b15eb40c48e00f","45d11bc27761a502bce036adbcf64f7d","77ec3d28","[5]SummaryInformation",412,0,"WIN",""
"36da55c877afeb7af220c80b7021368985a1d24e","d2dd55a6b2d6d768ab6254c169d41ce9","0c1fb7c4","!Media",12,0,"WIN",""
"3690c3ca05b6ed8ef86589dc93495811b1fb49db","8806ebee0e08ab6338d0fdc87be83fc4","b9c0a320","Binary._70B6BD6470D90F593F71019EF5DC9D42",12192,0,"WIN",""
"3ccfc07833e6025ffdc0b110c45c2e2c97588efc","d6b3635d8e144efae4ab1753695c19af","abc7c53e","!Dialog",594,0,"WIN",""
pmedina@forensic:~/hashdog$


In the example above, we used the switches ‘--md5-file’ and ‘--rds-file’ to generate both a file with MD5 checksums as well as a file that is using the RDS format at the same time. Currently, hashdog can generate hashsets using the MD5 checksums, SHA-1 checksums and hashsets in the RDS format, but support for more formats might be added in the future.