Re: MD5 and LIFNs (was: Misc Comments)

Larry Masinter (masinter@parc.xerox.com)
Sat, 30 Apr 1994 02:13:12 PDT

To: uri@bunyip.com
Subject: Re: MD5 and LIFNs (was: Misc Comments)
From: Larry Masinter <masinter@parc.xerox.com>
Message-Id: <94Apr30.021323pdt.168962@nebula.parc.xerox.com>
Date: Sat, 30 Apr 1994 02:13:12 PDT

Some rough calculations about the likelihood that two files will have
the same MD5 signature:

MD5 signatures will only work if you can construct an index of them.

Let's suppose that the index we'll work with is at most 100 terabytes,
with 100 bytes for each entry (signature + URL). That's at most 2^40
different items.

If you assign each of 2^40 items a K-bit random number, the likelihood
that two will get the same number is
2^79/2^K.

The MD5 signature is 128 bits. Even if it isn't really random, let's
suppose it loses 32 bits of randomness; that's still 96-bits of
randomness; this works out to a .0005 % chance that even two documents
would be assigned the same number.

Now, maybe MD5 isn't really even that random, but then, there aren't
really 2^40 documents that you'd want to name.

Yes, the `birthday paradox' is that the likelihood that two things
will have the same hash goes up with the square of the number of
things you're hashing; however, the hash space is large enough that
even with an outrageously large number of documents, the likelihood
that just two of them would have the same hash is still small.