## Data Validation Using The MD5 Hash

Most law enforcement computer forensic specialists rely upon mathematical validation to verify that the restored mirror image of a computer disk drive and relevant files exactly match the contents of the original computer. Such comparisons help resolve questions that might be raised during litigation about the accuracy of the restored mirror image. They also act as a means of protection for the computer forensic specialist concerning allegations that files were altered or planted by law enforcement officials during the processing of the computer evidence. In the past 32 bit algorithms were used for this purpose and programs such as CRCHECK and CRC32 became popular. More recently it has become necessary to use more accurate mathematical calculations for this purpose, i.e. 128 bit hashes. The reasons are tied to the potential for brute force attacks using todays powerful desktop computers and also the volume of files that exist on contemporary computer hard disk drives. It is not uncommon to find over 100,000 files stored on computer hard disk drives today and the storage capacity seems to increase every few months due to advances in technology. NTI has adopted the use of the RSA MD5 algorithm for this purpose and it has been incorporated in NTI's FILELIST, CRCMD5 and DISKSIG programs.

Questions have recently been raised regarding the accuracy of 128 bit hashes and the need for such programs in computer processing. One such question was recently raised in a law enforcement Internet list server group by Fred Kerr. Fred is a retired U. S. Air Force, OSI, Special Agent who has a background in forensics and computer evidence processing. Brian Deering provided the following response which provides good food for thought regarding the issue. Brian's comments follow:

This is a toughie but one that intrigues me, for obvious reasons.

Unlike calculating the simple probability of a random event, say the flip of a coin where the input (the coin) and the possible outcomes (actually three, if you consider that it could land on the edge) are a pretty well defined the calculation of the MD5 hash value is NOT a random process. Therefore, since the possible inputs are infinite and the possible outputs are finite and the actual output is formulaically derived and not a random event simple statistical probabilities will probably fail.

Consider the following, however. The MD5 creates a numeric representation of the contents of a message and displays it as a 16 character hexadecimal value. Theoretically, there are 2^128 possible MD5 hash values. Therefore, if you have a file system with 2^128+1 different files on it I can guarantee you that there are at least two different files that will generate duplicate hash values.

Consider the reality of 2^128 possibilities. There are actually 3.402 x 10^38 or 340 billion billion billion billion or a little more than 1/3 of a googol possibilities. When you consider that most people have never seen a million of anything the actual number becomes really difficult to conceptualize. (Leave it to my nine year old to teach me that a googol, which really exists, is actually 10^100.)

Consider the statement in a 1992 MIT document http://theory.lcs.mit.edu/~rivest/Rivest-MD5.txt that says, "It is conjectured that it is computationally infeasible to produce two messages having the same message digest, or to produce any message having a given pre-specified target message digest." Note the wording carefully, "computationally infeasible", not impossible. Some would argue that ca. 1992, when a 16 MHZ 80486 was a high speed computer, this statement was probably accurate but that today, in the era of 400 MHZ Pentiums, Super Crays and heaven knows what else folks have stuffed into there basement closets, it's no longer INFEASIBLE.

So, in the strictest sense, if you had sufficient computing capability (and I don't know who that would be but I can pretty well bet that no one willing to associate with the likes of you and me has this) then you could conceivably create two different documents, presumably one that inculpates a defendant, with identical hash values. Is this a deficiency of the MD5? Is this a potential problem for courtroom purposes? No, on both questions!!! With any evidence, isn't it possible to tamper with it in such a way as to create a desired output? The obvious response to any questions along this line is, "Yes, it is possible to intervene and effect a possible outcome." and "No, I did not tamper with the evidence."

So, now the more germane question. Could an accidental change occur to data in such a manner that two different files generated identical hash values. The answer is a clear, "Yes." Again, there are a finite number of possible hash values and an infinite number of possible files.

There are actually two interacting probabilities working here. First, what's the probability of any specific document being generated by a random occurrence and second, what's the probability that the resulting document would generate a hash value identical to an already existing file.

First. Examine the probability of a specific accidental change occurring. Say an electrostatic discharge scrambles your data. You finished with a document X characters long, there are 256 different possible outcomes for each character, the possible outcomes for a scrambled document of X characters then is 256^X. The probability then of randomly creating a document that says (without quotes), "I did it." is 256^9 or 1:4,722 X 10^24.

Second. Consider the possibility of a duplicate hash value occurring randomly. I've stated earlier that this is hard to calculate because the MD5 is formulaic rather than random. So let's assume, and there is nothing in the literature anywhere I've found to suggest this, that the MD5 is flawed in some way and doesn't generate all of the 2^128 possible outcomes. Say it only generates even numbers and let's say it doesn't do that very well and only generates half of the possible even numbers. (We're now down to one fourth of the total possible values.) The possible outcomes is now a mere 2^126 or 8.507 x 10^37. Conservatively, the probability of different files generating duplicate hash values is minuscule. Let's go with 1:8.507 x 10^37. I believe that one fourth of the total possible outcomes is a pretty conservative estimate but if you don't think so feel free to use whatever fractional amount you choose, it won't materially change the final outcomes.

NOW. Let's synthesize. Initially it's really unlikely (1:256^X, where X represents the total characters in the document) that any random change will create any specific document. Then, consider the probability that the randomly created document would generate an identical hash value (1:8.507 x 10^37) and you get the following: The probability that a random change could occur that created any specific document of X number of characters and that the newly created document would create a specific hash value is (1:256^X) * (1:8.507 x 10^37). Now, in my advanced years I'll admit that my brain is a little addled but I think that calculates out to 1:2,177 x 10^(37+X).

If you accept the logic to this point, your challenge, should you accept it, is to relate this to someone on a jury. How about this as a comparison. In Pennsylvania the probability of me winning the Super 6 Lotto with any one ticket is 1:39,000,000. Using the 9 character document described above, I will win the Pennsylvania Super 6 Lotto about 5.582 x 10^41 (or 558,205 billion billion billion billion) times before these two events occur. The probabilities become even more astronomical as the document of X characters increases in length.

So, here's the summary. It's all moot anyway. Because I'm only going to win the Lottery once and then retire.

Fred, are you a betting man?

What is MD5 and why do I care?
MD5

When you send data over a network, there are three issues most organizations have, security, authenticity, and integrity. The security of your data ensures that no one can read your data. This is important for the military, where secrets have to be kept from enemy hands. Authenticity guarantees the originator of the data, you know for certain who sent the data. This is important for the legal world, such as digital signatures. Integrity guarantees that the data has not been altered in transit, that the data you received is the data that was sent. This is important for many industries, such as the financial world. MD5 is such a tool, it guarantees the integrity of your data.

MD5 can help you in a variety of ways. When you download files from the Internet, you can use MD5 to guarantee you downloaded the correct file. This protects you from Trojans or corrupted files. If you uses tools such as Tripwire to protect the integrity of your filesystem, you are most likely using MD5. You are most likely using MD5 if you are using a public/private key infrastructure.

Developed in 1994, MD5 is a one-way hash algorithm that takes any length of data and produces a 128 bit "fingerprint" or "message digest". This fingerprint is  "non-reversible", it is computationally infeasible to determine the file based on the fingerprint. This means someone cannot figure out your data based on its MD5 fingerprint. Here is an example of a MD5 output for the binary /usr/bin/ls:

homer \$md5 /usr/bin/ls

MD5 (/usr/bin/ls) = 1eabd3dbc0746c8a4b5467f99a4f8823

The actual finger print is

1eabd3dbc0746c8a4b5467f99a4f8823

Basically, what MD5 did was apply a mathematical algorithim to the "ls" binary to produce the fingerprint (to learn the gory mathematical details about the algorithim, check out RFC 1321 at http://www.cis.ohio-state.edu/rfc/rfc1321.txt.) Everytime you do a MD5 hash of the binary /usr/bin/ls, you should get the exact same fingerprint. If you get a different fingerprint, then the binary has been altered, maybe the result of a system patch or the binary has been trojaned.

When you download a new file or patch, one of the first things you can do is a MD5 hash of the file. Compare the fingerprint to a known good fingerpint (usually posted on remote site). If the fingerprints match, you can be assured of the file?s integrity. This is how the tool Tripwire works. It builds a database of fingerprints for all your binaries, then later on compares the binaries to that database. However, tripwire uses a variety of hash algorithms in addition to MD5, such as snefru.

Since MD5 does not encrypt data, it is not restricted by any exportation rules. You can freely use and distribute this tool anywhere in the world. To learn the history of MD5, check out http://www.rsasecurity.com/. You can download MD5 at http://archiv.leo.org/pub/comp/general/security/md5/.