Tuesday, July 10, 2012

Searching Hash Databases

I have previously showed you can generate your own hash databases and also how to download and use the NSRL databases from NIST. In this blog post I will describe how you most effectively can use these hash databases to search for hashes and the steps involved in getting there.

Removing duplicate entries
If you have generated hashes from files located on disk cloning images or from shares with application installers, the chances are very high that your databases has entries of files that are named differently but has the same hash. I always try to keep the databases I generate as small as possible and free from any duplicate entries of identical files. To show how we can remove these duplicate entries I will use the Unix commands ‘sort’ and ‘uniq’ on one of the databases I generated in a previous blog post.

pmedina@forensic:~$ head -5 /tmp/example-rds.txt
pmedina@forensic:~$ head -1 /tmp/example-rds.txt > /tmp/example-rds-unique.txt
pmedina@forensic:~$ sort -n -k1,1 /tmp/example-rds.txt | uniq -w 42 >> /tmp/example-rds-unique.txt
pmedina@forensic:~$ wc -l /tmp/example-rds*
  2213 /tmp/example-rds.txt
  1876 /tmp/example-rds-unique.txt
  4089 total

As you can see above, the hash database I removed the duplicate entries from is using the RDS format. Therefore, the commands above to will only work on a hash database that uses that format. By sorting the file and only keeping the lines that had a unique SHA-1 checksum, I managed to get rid of 337 entries in the database. This might not seem like a big gain at first but keep in mind that this database is a very small database that was generated from just a few files. Having databases with unique entries will save your disk space and even speed up your searching.

Searching for a hash
Our hash databases are in essence files using the ASCII text format and the classic way to search a text file is by using the Unix command ‘grep’. This command is very versatile and can be used in many different ways, including searching for hashes in a hash database. The ‘grep’ command below is me searching for a MD5 checksum in the file ‘NSRLFile.txt’ from the “minimal” hashset I have downloaded from one of the reduced RDS files from NIST.

pmedina@forensic:~/NSRL/RDS_236m$ time grep ",\"91D2AB5FDB8A07239B9ABB1945A167B7\"," NSRLFile.txt

real    0m52.153s
user    0m0.332s
sys     0m5.144s

The hash I was looking for is listed in the last entry in the database and it took 52 seconds on my system for ‘grep’ to find this row. The command ‘grep’ does a linear search using the Boyer–Moore string search algorithm and is in general a very efficient way to search for data in a file. For our needs however, we do have to try to speed up the search.

Indexing a hash database
A more efficient way to search for a hash in a large database is to create an index file and use a binary search algorithm to find the hash we are looking for. The program ‘hfind’ that is included in The Sleuthkit by Brian Carrier does exactly just that. It lets us create an index file of the hash database which the program will later use to perform a binary search for the hash.

pmedina@forensic:~/NSRL/RDS_236m$ time hfind -i nsrl-md5 NSRLFile.txt
Index Created

real    11m57.155s
user    2m56.219s
sys     0m47.571s
pmedina@forensic:~/NSRL/RDS_236m$ ls NSRLFile.txt*
NSRLFile.txt  NSRLFile.txt-md5.idx

Creating the index will take some time but after ‘hfind’ has finished processing the database, the index file can be located in the same directory as the hash database you created the index for. This index file is named the same as your hash database but has the hash algorithm it is using appended to it as well as the file extension of ‘.idx’. In our case the file is named ‘NSRLFile.txt-md5.idx’.

Content of the index file
The actual content of the index file is just the hash of the algorithm we are using followed by an offset to where in the hash database more information can be retrieved. The hashes are sorted so that the binary search algorithm will work.

pmedina@forensic:~/NSRL/RDS_236m$ head NSRLFile.txt-md5.idx

Looking at the index file and searching for the same hash we searched for above, we see that the entry in the hash database that holds more information regarding the file is located at offset 3339032584.

pmedina@forensic:~/NSRL/RDS_236m$ grep 91D2AB5FDB8A07239B9ABB1945A167B7 NSRLFile.txt-md5.idx

Skipping to that offset in our hash database will take us exactly to the entry in the database that holds the information regarding the file we are looking for.

pmedina@forensic:~/NSRL/RDS_236m$ tail -c +0000003339032584 NSRLFile.txt


Size of the index file
The size of the index file will depend on the format of your hash database and the hash algorithm you use to create the index. In our case we created a MD5 index of a hash database that is using the RDS format. If we want to calculate the size of the index file, all we need to do is to take the number of lines in your hash database, remove 1 line that contains the header and multiply that value with the number of characters each line hold. In our case that value will be 50, 32 characters for the MD5 hash plus one character that separates the 16 character offset value that is followed by a line separator.

pmedina@forensic:~/NSRL/RDS_236m$ wc -l NSRLFile.txt
25892925 NSRLFile.txt
pmedina@forensic:~/NSRL/RDS_236m$ echo "25892925 -1; . * 50; . + 47" | bc
pmedina@forensic:~/NSRL/RDS_236m$ wc -c NSRLFile.txt-md5.idx
1294646247 NSRLFile.txt-md5.idx

The size will be different if you are using a SHA-1 hash as your index or another database format than the RDS.

Using the index
Searching through the hash database with the use of the index file is a lot faster than using the ‘grep’ command demonstrated above. I will again search for the same MD5 hash as I searched for before, this time using ‘hfind’ and the index I created.

pmedina@forensic:~/NSRL/RDS_236m$ time hfind NSRLFile.txt 91d2ab5FDB8A07239B9ABB1945A167b7
91d2ab5FDB8A07239B9ABB1945A167b7        sharemfcRes.dll

real    0m0.051s
user    0m0.000s
sys     0m0.004s

The search completed in less than one second and printed the name of the file that is associated with the hash. If you look carefully at the command above you will see that I mixed both upper and lower cases which 'hfind' can handle perfectly. Under most circumstances the fact the hash matched an entry in the database is all you need to know to either discard the file or alert on its presence. There is however situations where I would like to know more information, like the full path to the file.

Using a hash reference database
They way I usually go about when generating databases with hashes I want to search for, is to generate one hash database in the RDS format as well as another one with just the MD5 hashes of the files. The hash database that is using the RDS format only includes the filename while the MD5 hash database has the full path to the file that was processed. In most computer forensic investigation cases, having the full path to the file that is being matched might not be so useful - the exact location of the file you are investigating will most likely not match the one you have in your hash database. Sometimes however, I have found that there is a need to know the exact path to the file that produced the hash. This is the reason I also generate a hash database that has the full path to the files, the hash reference database.
In a previous blog post I used hashdog to generate two hash databases from the same set of files. The first database I generated is using the RDS format, the same database that we removed the duplicates from above. The second one is a MD5 hash database that was generated with the option ‘--md5-fullpath’ enabled. This is my hash reference database that includes the full path to the files that were processed. After created an index of the hash database in the RDS format, I am using this database to search for a MD5 hash of a file I have on disk.

pmedina@forensic:~$ hfind /tmp/example-rds-unique.txt ae182dc797cd9ad2c025066692fc041b
ae182dc797cd9ad2c025066692fc041b        System.dll

As you can see, I get a match for my hash and the name of the file is returned to us. If we want to know from were exactly the checksum was generated from, we can simply grep for the same checksum in our reference database.

pmedina@forensic:~$ grep ae182dc797cd9ad2c025066692fc041b /tmp/example.txt
ae182dc797cd9ad2c025066692fc041b  Firefox Setup 13.0.1.exe/core/uninstall/helper.exe/\x01\x1a/System.dll
ae182dc797cd9ad2c025066692fc041b  Firefox Setup 13.0.1.exe/core/maintenanceservice_installer.exe/\x01\x1a/System.dll
ae182dc797cd9ad2c025066692fc041b  Firefox Setup 13.0.1.exe/setup.exe/\x01\x1a/System.dll

The location of the file that produced the hash is presented to us and by looking at the result it is clear that the file in question is part of the Firefox application. If you want you can also remove all the duplicates from the reference database as we previously did for the RDS database above. Personally, I like to keep the hash reference database in its original format, including duplicate entries of the same file. This is the only reference I have if I ever want to trace back the origin of the file I am analyzing.

Compressing a hash database
Using an indexed hash database is a very fast and efficient way of searching through a large database for a hash value. However, something to consider when indexing a database is size. On top of having a large database that contains the hashes and other information about the file, the index file itself will also add to the disk space that is being used.
The largest hash database I have on disk is my hash reference database - the database that includes the full path to the files. This is a database I do not frequently use so instead of creating an index file that will occupy even more disk space, I will actually compress this database to save disk space. I will illustrate this below by using the same database we used to ‘grep’ the MD5 hash in above and the Unix command ‘gzip’.

pmedina@forensic:~/NSRL/RDS_236m$ gzip -c NSRLFile.txt > /tmp/NSRLFILE.txt.gz
pmedina@forensic:~/NSRL/RDS_236m$ gzip -l /tmp/NSRLFILE.txt.gz
         compressed        uncompressed  ratio uncompressed_name
         1784524236          3339032716  46.6% /tmp/NSRLFILE.txt

As you can see from the command output above, by compressing the database we managed to reduce its size to almost half its original size.

Searching in a compressed file
Now that the database is compressed we cannot use the regular ‘grep’ command to search the file. Instead we have to use a command like ‘zgrep’ that will uncompress the file on the fly before using ‘grep’.

pmedina@forensic:~/NSRL/RDS_236m$ time zgrep "^\"FFFFFF7C06B881D7C1F011688E996C11E1699E97\"," /tmp/NSRLFILE.txt.gz

real    1m4.639s
user    0m40.735s
sys     0m10.917s

This command takes a few seconds more to complete than searching through the uncompressed database. In most cases, the gain you get in disk space is worth the increase in processing time, the time it takes to search through the file.

Optimizing decompression
The program ‘gzip’ was written almost 20 years ago and today there are other programs that are better suited for the type of computer hardware we have today. One replacement program for ‘gzip’ we can use is ‘pigz, parallel implementation of gzip. This program supports parallel execution in multiple threads and utilizes more available CPU power on multicore or multiprocessor systems than the standard gzip implementation. By replacing ‘gzip’ with ‘pigz’ in the ‘zgrep’ shell scrip we can reduce the time it takes for us to search through the compressed hash database.

pmedina@forensic:~/NSRL/RDS_236m$ diff /bin/zgrep /bin/pigzgrep
<     (gzip -cdfq -- "$i" 5>&-; echo $? >&5) 3>&- |
>     (pigz -cdfq "$i" 5>&-; echo $? >&5) 3>&- |
pmedina@forensic:~/NSRL/RDS_236m$ time pigzgrep "^\"FFFFFF7C06B881D7C1F011688E996C11E1699E97\"," /tmp/NSRLFILE.txt.gz

real    0m54.983s
user    0m29.658s
sys     0m10.085s

By switching to ‘pigz’ the time to process our compressed hash database only increased by a few seconds compared to processing the uncompressed file. For the way I am using these hash reference databases, these extra seconds are acceptable, especially considering the disk space that is being saved.

Friday, July 6, 2012

Post-processing hash databases from NIST

As mentioned in my previous blog post, the National Software Reference Library(NSRL), a part of the National Institute of Standards and Technology (NIST), generates and periodically releases hash databases called Reference Data Sets (RDS). Each entry in the RDS contains information such as the MD5, SHA-1 and CRC32 checksums of the file as well as its size and which product it belongs to. The RDS are published on downloadable .ISO which holds a .zip archive that contains five files;
  • NSRLFile.txt  - Main hash text file.
  • NSRLMfg.txt - Manufacturer listing.
  • NSRLOS.txt - Operating systems listing.
  • NSRLProd.txt - Product listing.
  • hashes.txt - SHA-1 hashes for the files above.

NIST also makes other files available to us, including a .zip archive called the "minimal" hashset. This hash database contains all the entries that the other RDS databases has but only lists one example of every file in the NSRL. It is this reduced hash database that I will be using in the examples below.

Products part of the database
Most forensic investigators that I know and ever talked to use the RDS from NIST as their ‘KnownGood’ hashset, actively discarding any file that generate a match in these databases. What NIST says regarding this matter is that the files used to generate the information in the RDS are actually files that are not known to be good nor known to be bad, just files that are known. Entries in the RDS are not only generated from files part of operating systems, but also from application that might be unwanted or even be considered as malicious by some organizations. It is recommended by NIST that the forensic examiner partition the RDS file so that any unwanted applications are excluded from the database.

To accomplish this we first need to process the ‘NSRLProd.txt’ file, a file that contains a list of all the products that NIST has generated hashes for in the hash database. To make it easier for us to understand exactly what is included in the RDS we need to create a uniquely sorted list of all the ‘ApplicationType’ fields, the last field of the comma separated file.

pmedina@forensic:~/NSRL/RDS_236m$ head NSRLProd.txt
1,"Norton Utilities","2.0 WinNT 4.0","WINNT","SYM","English","Utility"
7,"Harvard Graphics","3.0 Upgrade","DOS","SPC","English","Presentation"
8,"ScreenShow","N/A","DOS","SPC","English","Screen Saver"
9,"Norton Utilities","8","DOS","SYM","English","Utility"
9,"Norton Utilities","8","Gen","SYM","English","Utility"
9,"Norton Utilities","8","WIN","SYM","English","Utility"
16,"Report Writer","N/A","DOS","CLA","English","Reports"
pmedina@forensic:~/NSRL/RDS_236m$ tail -n +2 NSRLProd.txt | awk -F,\” '{print $NF}' | tr -d '"' | sort -u
3d computer graphics and design
3d computer graphics and design,architectural
3D Landscaping amd Animation
3D Landscaping amd Animation,Design Suite,Tutorial
3D Landscaping amd Animation,Graphics Suite
3D Landscaping amd Animation,Modeling Software
Accessories,Configuration & Management
X Server
X Server,X Windows
X Windows
Year 2000

Looking through the list, the application types that catches my attention right away are named ‘Cryptography’, ‘Disk Wiper’, ‘employee monitoring’, ‘Encryption’, ‘File Sharing’, ‘Hacker Tool’, ‘Keyboard Logger’, ‘p2p client’, ‘password recovery’, ‘privacy tool’ and ‘Steganography’. Even though you might find any files that are part of these application types, it does not automatically mean that the system you are investigating has been compromised. What is does mean is that you need to look into the matter. It might just be that there is a legitimate use for having products of the application type ‘Disk Wiper’ shown below installed on the system you are investigating.

pmedina@forensic:~/NSRL/RDS_236m$ grep -e 'Disk Wiper\"' NSRLProd.txt
18640,"Paragon Disk Wiper 8.5 Personal Edition","2007","190","731","English","Disk Wiper"
18640,"Paragon Disk Wiper 8.5 Personal Edition","2007","200","731","English","Disk Wiper"
18640,"Paragon Disk Wiper 8.5 Personal Edition","2007","204","731","English","Disk Wiper"
18640,"Paragon Disk Wiper 8.5 Personal Edition","2007","209","731","English","Disk Wiper"
18640,"Paragon Disk Wiper 8.5 Personal Edition","2007","226","731","English","Disk Wiper"
18640,"Paragon Disk Wiper 8.5 Personal Edition","2007","231","731","English","Disk Wiper"
18640,"Paragon Disk Wiper 8.5 Personal Edition","2007","237","731","English","Disk Wiper"
18641,"Absolute Disk Wiper","None","189","727","English","Disk Wiper"
18746,"Active@KillDisk Professional Suite 5.0","1999-2008","189","1211","English","Disk Wiper"
18748,"Active@ Eraser Professional 4.1","4.1","189","1211","English","Disk Wiper"
18750,"East-Tec Eraser 2008","c.1998-2008","189","1209","English","Disk Wiper"
22369,"Wipedrive Six","c. 2010","231","1543","English","Disk Wiper"
22369,"Wipedrive Six","c. 2010","237","1543","English","Disk Wiper"
22369,"Wipedrive Six","c. 2010","359","1543","English","Disk Wiper"
23309,"DISKExtinguisher","c. 2009","190","1544","English","Disk Wiper"
23309,"DISKExtinguisher","c. 2009","194","1544","English","Disk Wiper"
23309,"DISKExtinguisher","c. 2009","231","1544","English","Disk Wiper"
23309,"DISKExtinguisher","c. 2009","237","1544","English","Disk Wiper"
23310,"File Extinguisher","c. 2011","190","1544","English","Disk Wiper"
23310,"File Extinguisher","c. 2011","231","1544","English","Disk Wiper"
23310,"File Extinguisher","c. 2011","237","1544","English","Disk Wiper"
Partitioning the database
Now that you have found the application types that you want to separate from your existing hash database we need to divide the file ‘NSRLFile.txt’, creating two new files, something NIST calls partitioning the hash database. There are many ways we can go about to do this but I will be using a program called ‘nsrlext.pl’, part of the ByteInvestigator toolkit by Tony Rodrigues. This program could initially only search for entries belonging to the application type ‘Hacker Tool’ and separate these entries from the rest of the database. Since I needed to do some more extensive searching and partitioning, I patched Tony’s program so it now has the possibility to search for more application types than just the ‘Hacker Tool’. Follow the instructions below how to download ‘nsrlext.pl’ and apply my patch.

pmedina@forensic:~/NSRL/RDS_236m$ wget --no-verbose http://downloads.sourceforge.net/project/byteinvestigato/byteinvestigatr/0.1.6/ByteInvestigator0.1.6.zip
2012-07-05 15:08:12 URL:http://iweb.dl.sourceforge.net/project/byteinvestigato/byteinvestigatr/0.1.6/ByteInvestigator0.1.6.zip [103132/103132] -> "ByteInvestigator0.1.6.zip" [1]
pmedina@forensic:~/NSRL/RDS_236m$ md5sum ByteInvestigator0.1.6.zip
4a9d5e3d004f95caabd4fe5ab1a70d2a  ByteInvestigator0.1.6.zip
pmedina@forensic:~/NSRL/RDS_236m$ 7z x ByteInvestigator0.1.6.zip nsrlext.pl

7-Zip [64] 9.20  Copyright (c) 1999-2010 Igor Pavlov  2010-11-18
p7zip Version 9.20 (locale=en_US.UTF-8,Utf16=on,HugeFiles=on,1 CPU)

Processing archive: ByteInvestigator0.1.6.zip

Extracting  nsrlext.pl

Everything is Ok

Size:       2585
Compressed: 103132
pmedina@forensic:~/NSRL/RDS_236m$ cat ../nsrlext.pl-v01.patch
> use strict;
< my $ver="0.1";
> my $ver="0.1m";
< %args = ( );
< getopts("hn:p:g:b:", \%args);
> my %args = ( );
> getopts("hn:p:g:b:s:", \%args);
< if ($args{h}) {
> my $error_msg="\n";
> unless ($args{n}){$error_msg.="Enter the NSRL hashset file list (comma delimited)\n";$args{h}="";}
> unless ($args{p}){$error_msg.="Enter the NSRL product file list (comma delimited)\n";$args{h}="";}
> if (defined $args{h}) {
>  -s :string to search for. Default "Hacker Tool". Ex: -s "Web Builder|Presentation"
>    print "$error_msg\n";
< die "Enter the NSRL hashset file list (comma delimited)\n" unless ($args{n});
< die "Enter the NSRL product file list (comma delimited)\n" unless ($args{p});
< die "Enter known good and/or known bad output filenames\n" unless (($args{g}) || ($args{b}));
< my %hack;
> my (%hack,$string);
> if ($args{s}){$string=$args{s};}
> else {$string="Hacker Tool";}
< foreach $item (@prod) {
> print "searching for products labeled: $string\n";
> foreach my $item (@prod) {
<       $hack{$line[0]} = $item if ($line[6] =~ /Hacker Tool/);
>       $line[6]=~s/\W+$//;
>       $line[6]=~s/^"//;
>       $line[6]=~s/"$//;
>       $hack{$line[0]} = $_ if ($line[6] =~m/^($string)$/i);
> print "excluding the following products:\n";
> foreach my $product (keys %hack){print "$hack{$product}\n";}
> unless (($args{g}) || ($args{b})){print "\nPlease use '-g' and '-b' to specify the files to store the result of the hash database partition\n";exit;}
< foreach $item (@hset) {
> foreach my $item (@hset) {
> print "processing hash file: $item\n";
>       if ($i==0){
>               print BAD $_ if ($args{b});
>               print GOOD $_ if ($args{g});
>                       $i++;
>               next;
>       }
<       print ">" if (($i % 10000) == 0);
>       if ($i == 10000){print STDERR ".";$i=1;}
> Modified by Par Osterberg Medina
> return ();
pmedina@forensic:~/NSRL/RDS_236m$ patch nsrlext.pl -i ../nsrlext.pl-v01.patch -o nsrlext-v01m.pl
patching file nsrlext.pl
pmedina@forensic:~/NSRL/RDS_236m$ chmod +x nsrlext-v01m.pl
pmedina@forensic:~/NSRL/RDS_236m$ ./nsrlext-v01m.pl

nsrlext.pl v0.1m
Extracts known good and known bad hashsets from NSRL
Tony Rodrigues
dartagnham at gmail dot com
Modified by Par Osterberg Medina

 uso: nsrlext.pl -n nsrl_files_comma_separated -p nsrl_prod_files_comma_separated [-g known_good_txt] [-b known_bad_txt] [-h]

 -n :nsrl files comma separated. Ex: -n c:\nsrl\RDA_225_A\NSRLFile.txt,c:\nsrl\RDA_225_B\NSRLFile.txt
 -p :nsrl prod files comma separated. Ex: -p c:\nsrl\RDA_225_A\NSRLProd.txt,c:\nsrl\RDA_225_B\NSRLProd.txt
 -g :known good txt filename. Ex: -g good.txt
 -b :known bad txt filename. Ex: -b bad.txt
 -s :string to search for. Default "Hacker Tool". Ex: -s "Web Builder|Presentation"
 -h :help

Enter the NSRL hashset file list (comma delimited)
Enter the NSRL product file list (comma delimited)


For us to partition the hash database and actually split the ‘NSRLFile.txt’ file in two parts we need to specify a couple of switches to the ‘nsrlext.pl’ program. First we need to use the switch ‘-m’ and give the program the path to our hash database, the ‘NSRLFile.txt’. The next step is to specify the path to ‘NSRLProd.txt’, the file that contains all the product codes for the entries in the hash database. We also need to specify which ‘ApplicationType’ strings to search for so we can build the list of products we are going to separate from the hash database. This is accomplished by using the switch ‘-s’ and specifying the names you want to search for. In the example below I am searching for the same application types that previously caught my attention above.

pmedina@forensic:~/NSRL/RDS_236m$ ./nsrlext-v01m.pl -n NSRLFile.txt -p NSRLProd.txt -s "Cryptography|Disk Wiper|employee monitoring|Encryption|File Sharing|Hacker Tool|Keyboard Logger|p2p client|password recovery|privacy tool|Steganography"

nsrlext.pl v0.1m
Extracts known good and known bad hashsets from NSRL
Tony Rodrigues
dartagnham at gmail dot com
Modified by Par Osterberg Medina

searching for products labeled: Cryptography|Disk Wiper|employee monitoring|Encryption|File Sharing|Hacker Tool|Keyboard Logger|p2p client|password recovery|privacy tool|Steganography
excluding the following products:
19296,"Spector CNE Investigator","1998-2008","254","903","English","employee monitoring"
20722,"uTorrent","0.9.2","395","1429","English","File Sharing"
9755,"ProDiscover Basic and ZeroView","2002-2006","WIN","TPathways","English","Encryption"
21763,"iMesh","10","189","1451","English","p2p client"
6490,"Hack Attacks Denied Second Edition","NA","WIN","Wiley","English","Hacker Tool"
3297,"Guide to Hacking Software Security 2002","1.0","WINXP","Silv","English","Hacker Tool"
6510,"Invisible KeyLogger 2000","NA","UNK","Amecisco","English","Keyboard Logger"
21805,"Limewire","4.14.8","189","1448","English","p2p client"
21777,"Limewire","5.5.14","189","1448","English","p2p client"
2099,"RPing","NA","WIN","Unknown","English","Hacker Tool"
4525,"Spector CNE","4.1","WINXP","Spectorsoft","Unknown","Keyboard Logger"
2102,"PortScan","NA","WIN","Unknown","English","Hacker Tool"

Please use '-g' and '-b' to specify the files to store the result of the hash database partition

If everything looks correct and the products you want to separate are listed in the command output, the only thing left to do is to specify the path to the files to store the result in. This is done by using the switches ‘-g’ and ‘-b’. The entries that matches the products we searched for will be put in the file specified with the ‘-b’ switch and all other entries will be put in the file specified with the ‘-g’ switch. Even though the ‘-g’ stands for good and the ‘-b’ stands for bad, I will not put the entries that matches my search criteria in the ‘KnownBad’ hash category. Instead I will put the database with the matches in the ‘KnownForbidden’ category explained in this blog post.

pmedina@forensic:~/NSRL/RDS_236m$ sudo ./nsrlext-v01m.pl -n NSRLFile.txt -p NSRLProd.txt -s "Cryptography|Disk Wiper|employee monitoring|Encryption|File Sharing|Hacker Tool|Keyboard Logger|p2p client|password recovery|privacy tool|Steganography" -g /opt/forensic/database/knowngood/RDS_236m-good.txt -b /opt/forensic/database/knownforbidden/RDS_236m-forbidden.txt

nsrlext.pl v0.1m
Extracts known good and known bad hashsets from NSRL
Tony Rodrigues
dartagnham at gmail dot com
Modified by Par Osterberg Medina

searching for products labeled: Cryptography|Disk Wiper|employee monitoring|Encryption|File Sharing|Hacker Tool|Keyboard Logger|p2p client|password recovery|privacy tool|Steganography
excluding the following products:
19296,"Spector CNE Investigator","1998-2008","254","903","English","employee monitoring"
20722,"uTorrent","0.9.2","395","1429","English","File Sharing"
9755,"ProDiscover Basic and ZeroView","2002-2006","WIN","TPathways","English","Encryption"
processing hash file: NSRLFile.txt
...................... OUTPUT REMOVED
Done !
pmedina@forensic:~/NSRL/RDS_236m$ wc -l NSRLFile.txt
25892925 NSRLFile.txt
pmedina@forensic:~/NSRL/RDS_236m$ wc -l /opt/forensic/database/knowngood/RDS_236m-good.txt /opt/forensic/database/knownforbidden/RDS_236m-forbidden.txt
  25804471 /opt/forensic/database/knowngood/RDS_236m-good.txt
     88455 /opt/forensic/database/knownforbidden/RDS_236m-forbidden.txt
  25892926 total

The NSRL hash database has now been separated into two hash databases. The first database is placed in our ‘KnownGood’ category and any files that match entries in that database will be automatically discarded. The second database that we created is a database that holds entries of files that are not allowed in our organization - generating alerts if any of these files are detected on a system we are investigating.

Thursday, July 5, 2012

Categories for hash databases

By now you have most likely started to generate some hash databases of your own using hashdog and it is time to start to put them to use. In this blog post I will describe how I usually go about to categorize my hash databases and how I use them. You might want do things differently based on the type of forensic investigations you are involved in or the type of environment you are supporting.

Common hash categories
You are today probably already using the Reference Data Set (RDS) from the National Software Reference Library (NSRL) as one of your hash databases. These databases have been created by NIST in a controlled environment and contain hashes from application or operating system that are mostly generated from files still on their original media. These files are known to be good and are usually put in a hash category called the ‘KnownGood’. This category contains hash databases of files that are known to be benign, files that you are not interested in investigating further.

You might even have hash databases that contain hashes of malicious files that you want to search for. Those databases are part of a hash category known as the ‘KnownBad’ category. As with the ‘KnownGood’ category you do not really want to spend your time analyzing any files you find matches for in this category. If you get a match for a hash in any of the databases you have in this category, chances are high that the malware already has been analyzed by multiple organizations before you came across it. A time better spend is trying to figure out how the malware got on the system you are investigating and if other data pieces like registry keys and tmp files are consistent with previous analysis that has been done. After all, the file could just have been placed on your system to throw you off and keep you from finding the real anomaly.

Extending the KnownGood and KnownBad categories
When I started to write hashdog and was generating hash databases of my own, it did not seem right to put some of my databases in the ‘KnownGood’ category. When I was creating databases from files downloaded directly from the vendor or extracted from a verified ISO image, I put the hash databases I generated in the ‘KnownGood” category. However, when I was generating hashes from standard OS build images and files listed in application shares, I had no really good way of guarantying that the files was absolutely free from malware. After thinking a lot and discussing it with Glenn, I came up with a solution that works for the kind of forensic investigations I am mostly involved in – looking for anomalies in a system that could indicate that the system has been compromised.

Instead of using just the two hash categories mentioned above I decided to use a third and a forth category, calling them the ‘KnownUsed’ and ‘KnownForbidden’ categories. The ‘KnownUsed’ category contains databases of hashes generated from files that are actively being used by the organization I am investigating. Any hits I get from hashes part of databases in this category are treated differently than any matches I get from one of my ‘KnownGood’ databases. For instance, if a file has a hash that is part of any of the ‘KnownGood’ databases, that file will be discarded immediately without any further analysis being made. If I get a hit for a hash part of one of the hash databases in the ‘KnownUsed’ category, the file will not be completely discarded but I will not pay so much attention to the file, at least not at my initial analysis. The argument for this is that if I am looking for an anomaly, it is highly unlikely that this anomaly is a file that I have previously generated a hash for and included in my ‘KnownUsed’ database.

The forth category that I call the ‘KnownForbidden’ contains databases of hashes created from files part of applications that is not allowed within the organization. Common hashes to put in this category are generated from files belonging to non-cooperate encryption software such as Truecrypt, penetration testing software like Metasploit and privacy and cleaning tools like CCleaner. These are applications that are not malicious in them self but could indicate a malicious use if they are found on a system I am investigating. As with the other categories mentioned above, I want to get alerted if any files are detected but I do not want to analyze any files. To sums things up these are the categories I use and they way I threat any matches I get for files in the hash databases.

  • KnownGood - Discard any files from further analysis.
  • KnownBad - Alert on any matches but do not analyze the file.
  • KnownUsed - Put these files aside for later analysis.
  • KnownForbidden - Alert on any matches but do not analyze the file.

By breaking up the hash categories this way it is easier for me to focus on the files that I have not seen before, the unknown file whose functionality is not known.