Comparing long strings by their hashes

Question

Trying to improve the performance of a function that compares strings I decided to compare them by comparing their hashes. So is there a guarantee if the hash of 2 very long strings are equal to each other then the strings are also equal to each other?

I believe so. Hashes are absolute representations of the data they contain. So equal strings should have equal hashes. — WhoaItsAFactorial, May 10 '12 at 13:24
Why not compare the strings in the first place. Calculating the hashes will force you to inspect every character of both strings. So does comparing them (but that may return "unequal" on the first mismatch) — wildplasser, May 10 '12 at 13:27
@Jeremy1026: That's simply not true. Suppose you use a 4-bit hash. 4 bits can hold 2^4 = 16 different values, so you could never distinguish between more than 16 strings with that hash. In practice, hashes are typically hundreds of bits, but there's always a limit to the number of items they can distinguish. Granted, collisions are extremely unlikely with a sufficiently long hash, but there's never a guarantee that different strings will have different hashes. — Adam Liss, May 10 '12 at 22:14

score 22 · Answer 1 · answered May 10 '12 at 22:08

22

While it's guaranteed that 2 identical strings will give you equal hashes, the other way round is not true : for a given hash, there are always several possible strings which produce the same hash. This is true due to the PigeonHole principle.

That being said, the chances of 2 different strings producing the same hash can be made infinitesimal, to the point of being considered equivalent to null.

A fairly classical example of such hash is MD5, which has a near perfect 128 bits distribution. Which means that you have one chance in 2^128 that 2 different strings produce the same hash. Well, basically, almost the same as impossible.

answered May 10 '12 at 22:08

Cyan

13,248
8
43
78

3

Interestingly, MD5 has been broken: an attacker can _intentionally_ create a string that hashes to any given value. There simply aren't enough bits, which is why SHA has become the current standard in cryptography. – Adam Liss May 10 '12 at 22:16
8

Yes, that's the very large difference between getting a "random collision" and getting an "intentional collision". On the random front, MD5 is still good enough. Now, if the system must take into consideration the risk of intentional collision (which is not always necessary), then yes, MD5 is no longer good enough. – Cyan May 11 '12 at 16:19
how does generating and comparing MD5 hashes can be faster than comparing original strings?!? – Aprillion May 16 '12 at 06:37
14

It may not be faster. In fact, it depends on the use case. Typically, if the comparison is done just once, then it's faster to compare directly the original strings. But if it must be compared several times, typically looking for duplicate, or if the result must be stored for later re-use, then comparing hashes gets the upper hand. – Cyan May 16 '12 at 09:25

Harvey · Answer 2 · 2016-10-13T07:46:11.070

In the simple common case where two long strings are to be compared to determine if they are identical or not, a simple compare would be much preferred over a hash, for two reasons. First, as pointed out by @wildplasser, the hash requires that all bytes of both strings must be traversed in order to calculate the two hash values, whereas the simple compare is fast, and only needs to traverse bytes until the first difference is found, which may be much less than the full string length. And second, a simple compare is guaranteed to detect any difference, whereas the hash gives only a high probability that they are identical, as pointed out by @AdamLiss and @Cyan.

There are, however, several interesting cases where the hash comparison can be employed to great advantage. As mentioned by @Cyan if the compare is to be done more than once, or must be stored for later use, then hash may be faster. A case not mentioned by others is if the strings are on different machines connected via a local network or the Internet. Passing a small amount of data between the two machines will generally be much faster. The simplest first check is compare the size of the two, if different, you're done. Else, compute the hash, each on its own machine (assuming you are able to create the process on the remote machine) and again, if different you are done. If the hash values are the same, and if you must have absolute certainty, there is no easy shortcut to that certainty. Using lossless compression on both ends will allow less data to be transferred for comparison. And finally, if the two strings are separated by time, as alluded to by @Cyan, if you want to know if a file has changed since yesterday, and you have stored the hash from yesterday's version, then you can compare today's hash to it.

I hope this will help stimulate some "out of the box" ideas for someone.

score 1 · Answer 3 · answered Feb 12 '16 at 21:34

1

I am not sure, if your performance will be improved. Both: building hash + comparing integers and simply comparing strings using equals have same complexity, that lays in O(n), where n is the number of characters.

answered Feb 12 '16 at 21:34

Fattum

996
1
9
23

Comparing long strings by their hashes

3 Answers3

Linked