8

To make it simple, my question is: how to hash a String (about 200 characters) as quickly as possible. Security is not important, but collisions ARE a big deal.

Note: After a quick investigation, it seems like MurmurHash3 might be the best choice. I am open to any comment saying otherwise tho'

First, I know that there are plenty of other similar question, but I couldn't find a convincing answer yet.

I have a list of objects, each containing a list of about 3k paragraphs which is saved to a database. Every X hours, those paragraph are regenerated and I need to find if any paragraphs has changed, and if so, push only those new paragraphs.

The quickest way I found to find the differences (knowing that most of the time the content will be identical) is to create a MerkleTree, save it to the DB, and iterate over the MerkleTree to find the differences, instead of comparing the paragraphs themselves.

This imply, in my case, that I will be creating ten thousands of hashes per second to compare with what is in the DB. Therefore, I need a very efficient way to create those hashes. I don't care about the security, I only need to ensure that the number of collision remains very very low.

What would be the best algorithm available in Java for that?


In my case, the main object is composed of Sections, which is composed of Languages, which is composed of Paragraph. The comparison strategy is:

1) If the object hash is identical, stop, otherwise go to 2)

2) Loop on all Section, keep only the Section with a different hash

3) Loop on all Languages of those Sections, keep only the language with a different hash

4) Loop on all the Paragraph of all those Languages, if the hash is different, then push the new content.

HLP
  • 2,142
  • 5
  • 19
  • 20
  • 2
    See: [Which hashing algorithm is best for uniqueness and speed?](http://programmers.stackexchange.com/q/49550/88986) – durron597 Aug 04 '15 at 18:49
  • I find the question rather unclear, do you need to just decide if a *specific* objects paragraph has changed or is the idea to *find* which object a paragraph belongs to (i.e. whats the primary key?). – Durandal Aug 04 '15 at 18:53
  • Also have a look at http://stackoverflow.com/questions/2624192/good-hash-function-for-strings – slartidan Aug 04 '15 at 18:59
  • @slartidan Unfortunately that link just recommends some basic hashing algorithms, it isn't really sufficient for someone that has as critical performance issue as OP seems to have. – durron597 Aug 04 '15 at 19:02

1 Answers1

7

This amazing answer on Programmers Stack Exchange tells you all you need to know.

The short version is, use FNV-1a, aka the Fowler–Noll–Vo hash function, it has excellent performance, high randomness and low collisions.

Any further explanation I might shed on this question would be just be a copy and paste from that Programmers.SE answer, which incidentally is the second highest voted answer on the entire site.

Some other thoughts:

  • Ultimately, you have a pretty niche use case. Most people aren't dealing with 1 billion entry datasets regularly. As such, you may have to do your own benchmarking.
  • That said, having a high randomness suggests that the algorithm is likely to scale well for English hashes.
  • You haven't really talked about other issues; are you able to keep the entire data set in memory? What are your footprint requirements?

See also: Fastest Hash Algorithm for Text Data

Community
  • 1
  • 1
durron597
  • 31,968
  • 17
  • 99
  • 158
  • Sounds cool, but I am a bit disappointed to see collision on a dataset of only 250k. To be clear, collision is a big deal for me, and I have over 1 billion entries. When looking at an algorithm with over 2^128 possibilities, you wouldn't expect any collision on such a small dataset ? – HLP Aug 04 '15 at 19:03
  • 2
    If you think about the reason for the collision, it's rather normal. The collisions happen on one-word data, so the data is actually very compact and having collisions is normal. The bigger the data, the lesser the collision. You say you've entire paragraphs, test the algorithms on the first 250k paragraphs you have and check the collisions in your actual context rather than in the guy's specific context. – Olivier Grégoire Aug 04 '15 at 19:14
  • I am willing to buy that. This being said, do you have an explanation on why a shorter String would have more chance to collide or is that just a theory? – HLP Aug 04 '15 at 19:19
  • 1
    For properly done hashes, the value to hash should be greater than the hash size. You say you want to hash text. Text is usually defined as 26 characters (lowercase) + 26 characters (uppercase) + punctuation and space (+/- 10 characters). That's roughly 6 bits of entropy. If your hash has a space of 64 bits, for your hashes to be relevant, you want at least 11 characters (`ceil(64/6)`). The guy in the link did his test on the dictionary. I'm sure 90% of words have less than 11 chars. So his test was good at testing speed. But testing real hash distribution? Nope. Way too few entropy. – Olivier Grégoire Aug 04 '15 at 19:50
  • 1
    I won't write any answer: it wouldn't answer the question itself, but thanks for the appreciation ;) – Olivier Grégoire Aug 04 '15 at 20:02
  • @OlivierGrégoire It's much less than 6 bits of entropy, by the way: http://what-if.xkcd.com/34/?utm_source=buffer&utm_medium=twitter&utm_campaign=Buffer&utm_content=buffereb47c – durron597 Aug 04 '15 at 20:07