4

I am trying to find something potentially faster then SHA256. I have over 1 billion records I need to hash and verify if they are unique. I am currently running it through an MD5 which seems pretty fast then through the sha256 to avoid collisions. Running them in that order seems to give me a little performance boost but I still need it faster. I am looking for the names or examples of some hashes done in c# or some pseudo-code so I can recreate it in c#.

EntryLevel
  • 215
  • 1
  • 4
  • 11
  • I'd say MD5 is suitable for your needs if the records are all pretty normal. You only get collisions from specifically crafted weird inputs, not from very similar, but slightly different, records. – Davio Jun 28 '13 at 12:31
  • 2
    `I am currently running it through an MD5 which seems pretty fast then through the sha256 to avoid collisions` Just to be sure, you're hashing to MD5, then if two records have the same hash you check the sha256 to avoid collisions? If so, you can try replacing MD5 with CRC, which should be way faster (but obviously will generate more collisions) – Kevin Gosse Jun 28 '13 at 12:33
  • I was just running them through both to try and further reduce the collisions – EntryLevel Jun 28 '13 at 13:28
  • 1
    Surely if there is a collision in the MD5 values, that will result in a collision in the SHA-256 output? Or am I misunderstanding how you are using SHA-256? – Duncan Jones Jun 28 '13 at 13:48
  • First off, you're never going to accidentally create an MD5 collision. Second, as others point out, if your first hash collides, your second one will two. Pick one and use just that. – Nick Johnson Jun 28 '13 at 15:39
  • 1
    `SHA-2(MD5(x))` is a bad idea. Doesn't offer an advantage over `MD5(x)` in your case. – CodesInChaos Jul 31 '13 at 15:40
  • How large are your records? How common are duplicates? Are they stored in RAM, or on a disk? Can a malicious entity create records? – CodesInChaos Jul 31 '13 at 15:41
  • I seriously cannot answer any of those question I think they are on disk. That is what kind of sucks – EntryLevel Jul 31 '13 at 15:46

6 Answers6

6

There is a lot of dubious information in the answers here. You tagged your question with cryptography and only mention cryptographic hash functions, but it sounds like you don't really need cryptographic security, in particular because you say:

I have over 1 billion records I need to hash and verify if they are unique.

There are four properties to a cryptographic hash function:

  • it is easy to compute the hash value for any given message
  • it is infeasible to generate a message that has a given hash
  • it is infeasible to modify a message without changing the hash
  • it is infeasible to find two different messages with the same hash.

You're really only interested in the first quality and the uniqueness is a smaller scale requirement only partially related to the other three properties of cryptographic security.

Why do you care?

There is overhead in cryptographic security. You don't need it, and you're interested in speed, so why not skip it? The hash width of MD5 and the SHA family are admittedly large enough for your purposes.

Check out the list of hash functions on wikipedia, or check out the article on normal hash functions. More to the point, what's wrong with the built in .NET hashing functions? Have you tried just deferring to the Object.GetHashCode() method? That MSDN reference has a lot to say about using hash functions. You don't say much about the data you're hashing, so it's hard to say whether the output would be unique between your objects or not. How are you feeding the object into the MD5 hasher? I presume you're taking the binary representation of it. A similar approach could be used to use the built-in non-crypto hash function.

You may be concerned about the uniqueness of the built in hash functions. They do only return a regular int, which is 2^32, only about 4 times larger than the data set you're working with. However, you always need to have a back up plan for hash functions. Collisions are infeasible, not impossible. The standard fallback is to perform a more expensive comparison, usually a reference comparison and a field-wise value comparison.

If you aren't prepared to do an exact comparison on your hash outputs, you're basically counting down until you get a false positive. That might not be a big deal for you: only you can judge what the downside there is.

Additionally, performing another hash function computation is probably not much faster than the direct comparison. You're better off on all counts to go with the sure thing and perform the lengthy, direct comparison.

Another common anti-collision technique is to use multiple keys. So if your data points have several large subcomponents, you hash and compare the independently. If it has some large and some small components (say some simple numeric types), you hash the large ones and do a direct comparison on the small ones. If they have some data that's easy to take the ordinal of (say the lengths of strings or the size of some containers), you can perform the direct comparison on those bits.

If that doesn't work out for you, take a look at implementations of the other hash functions listed on the wiki. Here's a pretty good reference for MurmerHash3, which can compute 32 bit or 128 bit hash values. There are other hash functions in the list that have long hash widths as well and also have C# libraries available. But as that reference points out, Murmurhash is way faster than MD5 and SHA functions, although it doesn't do a direct comparison to the Object.GetHashCode method I mentioned above.

Patrick M
  • 10,547
  • 9
  • 68
  • 101
  • 1
    With a 256 bit crypto hash, I wouldn't worry about backup plans. Chances of an accidental collision are far smaller than the chance of a random hardware error (e.g. a bit in your RAM flipping). - "Verification: A tax on people who are bad at math" – CodesInChaos Jul 31 '13 at 15:36
  • 1
    @CodesInChaos there's some truth to what you say. But when you're just using a hash function for fast uniqueness check, it can be substantially faster (to run; obviously slower to code and maintain) to use a shorter hash width, without crypto security, and backed up by the direct comparison. Since that's what the question focused on, that's how I framed my answer. It's all a matter of trade-offs: how slow is the hash, how slow is the direct comparison, what is the expected collision rate, what are the consequence of a collision etc. etc. – Patrick M Jul 31 '13 at 17:02
3

How about doing something different?

Use a simple hashing function on each record, like one you would use when inserting the record into a hash table, perhaps mapping each record to a 32 bit INT. Then if there were a hash collision you then compare the colliding records for uniqueness.

ase
  • 13,231
  • 4
  • 34
  • 46
  • +1 This basically means you're counting on the fact that if a very simple (and bad) hash differs, then a very good one will differ for sure. There's no false negatives. – Andrei May 07 '20 at 16:13
1

You can use MD5 then if you encounter colliding records you can check them with SHA256 or even SHA128.

VahiD
  • 1,014
  • 1
  • 14
  • 30
1

Are you checking every record with sha256? You should only need to check the records where you have md5 collisions, which should be rare even with md5. And at that point, when you're just comparing duplicates, it might be faster just to compare raw record to raw record, because the compare will return on the first difference.

Joel Coehoorn
  • 399,467
  • 113
  • 570
  • 794
  • with over 1 billion records what are my chances of a collision or where can I find that? – EntryLevel Jun 28 '13 at 13:39
  • About one collision for every 2^64 records. Having a collision at all is like winning the lottery... but you're buying enough tickets it just might happen. Maybe. Well, probably not, but you still need to be prepared. http://stackoverflow.com/questions/8852668/what-is-the-clash-rate-for-md5 – Joel Coehoorn Jun 28 '13 at 13:51
  • 1
    It's a birthday paradox problem. MD5 keys (the days) are 128 bits so there are `2^128` of them. With `1e9` records (the birthdays), the approximate chance of a collision is `1 - exp(-1e18 / 2^129) ~= 1.5e-21`. The chance of collision is low, but a lot higher than might naïvely be expected (the initial version of this comment contained a bug; my apologies). See this [answer](http://stackoverflow.com/a/1705517/45914) for additional details. – jason Jun 29 '13 at 00:54
0

You could even do something like take MD5 and if you get collision add a little extra data (same) to both the values and take MD5 again. It is highly unlikely for the 2 to have a collision again if they were different. So rather than doing SHA after the collision do MD5 again with something added which should be faster.

Ruchir Patwa
  • 171
  • 2
  • 10
0

From the way you phrased the question, it does not appear you need a security grade hash algorithm. You may not need a hash algorithm at all if you have conveyed all of the primary requirements of what you are trying to accomplish.

If you are constructing a method called unique that returns a Boolean true if and only if the two rows are unique, you can gain speed and retain reliability by using the following three row characteristics in this order.

  • length (if they are NOT fixed length records)
  • checksum
  • actual value

The first is probably already known if the record length is variable. The second can be quickly calculated at the time of storage. With a billion records, you will have to cover the possibility of collisions even if you use security grade hash algorithms (which you stated are too slow anyway). So when the checksum matches, which will be rare if you have a sufficient number of bits in the checksum, you will have to cover the case of comparing actual values byte by byte.

Douglas Daseeco
  • 3,475
  • 21
  • 27