10

So I understand that there is proof that MD5 can not guarantee uniqueness as there are more strings in the universe than there are MD5 hash strings, but is there any inverse proof for a finite number of strings?

Basically, if I have strings of maximum length of X, is there an X for which MD5 is guaranteed to be unique? if yes, then what is that X? and if there are more than one values for X, what is the maximum value of X?

or is there such an X for any other hashing algorithm, SHA-1, etc.?

Community
  • 1
  • 1
xtrahelp.com
  • 926
  • 8
  • 14
  • x = 1024 bits according to the following answer http://stackoverflow.com/questions/1999824/whats-the-shortest-pair-of-strings-that-causes-an-md5-collision – Oli May 15 '12 at 03:52
  • 2
    @Oli- That answer says that the shortest *known* hash collision requires 1024 bits. Since MD5 outputs 128-bit values, it's guaranteed that the shortest hash collision has to be much shorter than 1024 bits. – templatetypedef May 15 '12 at 03:56
  • so it is proven to be **not unique** for 1024 bits, but is it proven to be **unique** for less than 1024 bits? – xtrahelp.com May 15 '12 at 03:59
  • 2
    @xtrahelp.com- There has to be a collision for strings of 128 bits, because then there are too many strings for there not to be a collision. 1024 bits is far larger than necessary, but it's (apparently) the best that we've actually done. – templatetypedef May 15 '12 at 04:17
  • Although this is a well-posed question, the fact that you're asking about uniqueness of hashes at all rings alarm bells... – AakashM May 15 '12 at 08:49

2 Answers2

3

Summarizing the excellent answers here: What's the shortest pair of strings that causes an MD5 collision?

The shortest known attack on MD5 requires 2 input blocks, that is, 128 bytes or 1024 bits.

For any hash algorithm that outputs N bits, assuming it distributes inputs approximately randomly, you can assume a collision is more than 50% likely in about sqrt(2^N) inputs. For example, MD5 hashes to 128 bits, so you can expect a collision among all 64 bit inputs. This assumes a uniformly random hash. Any weaknesses would reduce the number of inputs before a collision can be expected to occur.

Community
  • 1
  • 1
Clueless
  • 3,984
  • 1
  • 20
  • 27
  • 1
    The earlier question asks about the smallest *known* hash collision. The value of 1024 bits is so much larger than the hash function's output size of 128 bits that the answer is meaningless in this question. – templatetypedef May 15 '12 at 03:57
  • 1
    Well, you can expect one in ~64 bits, but we only know how to reliably generate them in 1024 bits. I don't know if anyone has tested all ~2^64 short inputs for collisions. It's a lot of work, many years on one computer, but not impossible. – Clueless May 15 '12 at 04:11
  • @Clueless: I get 11,000 years for an 8-core 4 GHz machine using 600 cycles per hash (using http://cr.yp.to/talks/2008.06.05/slides.pdf as a reference for 600 cycles). – Charles May 15 '12 at 18:43
  • Yes, *computing* 2^64 MD5's is feasible by now. But *storage* for 2^64 outputs is I think much more expensive (and writing to the storage probably makes you IO bound?) There are ways to trade less storage -> more time but then time gets expensive... Anyway, yes, it's only a matter of some years till it's reasonably cheap... – Beni Cherniavsky-Paskin Feb 18 '15 at 12:43
1

The answer to your question is yes. For any hash function, there is a maximum length X for which you will get back unique strings. Finding X, though, could be very hard. The idea is to run this program:

X= 0;
For i = 0 onward
   For all strings of length i
      Compute the hash code of that string.
      If a collision is found, return X.
   X = i

The idea is to just list longer and longer strings until you find a hash collision. Eventually you'll have to, since eventually you'll have generated more strings than there are possible hash outputs.

On expectation, assuming that a hash function is actually pretty random, you'll need to generate O(√U) different strings before you find a collision, where U is the size of the space to which the hash function maps. For 256-bit hashes, this is 2256. This means that in practice the above program would never actually terminate unless the hash function was pretty broken, but in theory it means that your number X exists.

Hope this helps!

templatetypedef
  • 362,284
  • 104
  • 897
  • 1,065
  • well yes, so has anybody found that X yet? – xtrahelp.com May 15 '12 at 03:50
  • @templatetypedef that assertion doesn't really hold -- there are plenty of known collisions, and for that matter, attacks that make use of them, on hash algorithms considered strong. – Charles Duffy May 15 '12 at 03:53
  • 1
    @CharlesDuffy- Yes, you're correct (sorry about that!). My intended claim was that finding this value of X would probably imply that someone had found the shortest hash collision, and my understanding is that for most hash functions this has not been done yet (we just know of short hash collisions, not the shortest). – templatetypedef May 15 '12 at 03:54