1

I want to represent short texts (ie word, couple of words) as a 64 bit hash (want to store them as longs)

MessageDigest.getInstance("MD5") returns 128bits.

Is there anything else I could use, could i just peel off half of it. I am not worried of someone trying to duplicate a hash, I would like to minimize the number of clashes (two different strings having the same hash)

Smalcat
  • 231
  • 3
  • 18
  • I think the answer is use String's hashcode see this question: http://stackoverflow.com/questions/1660501/what-is-a-good-64bit-hash-function-in-java-for-textual-strings – M. Jessup Jun 01 '10 at 13:18
  • Fantastic! I hope it proves efficient. – Smalcat Jun 02 '10 at 08:00

5 Answers5

2

MD5 (and SHA) hash "smear" the data in a uniform way across the hashed value so any 64 bits ypu choose out of the final value will be as sensitive to a change as any other 64 bits. Your only concern will be the increased probability of collisions.

David Soroko
  • 8,521
  • 2
  • 39
  • 51
2

You can just use any part of the MD5 hash.

We tried to fold 128-bit into 64-bit with various algorithms but the folding action didn't make any noticeable difference in hash distribution.

Why don't you just using hashCode() of String? We hashed 8 million Email addresses into 32-bit integer and there are actually more collisions with MD5 than String hashCode. You can run hashCode twice (forward and backward) and make it a 64-bit long.

ZZ Coder
  • 74,484
  • 29
  • 137
  • 169
1

You can take a sampling of 64-bits from the 128-bit hash. You cannot guarantee there will be no clashes - only a perfect hash will give you that, and there is no perfect hash for arbitrary length strings) but the chances of a clash will be very small.

As well as a sampling, you could derive the hash using a more complex function, such as XOR consecutive pairs of bits.

mdma
  • 56,943
  • 12
  • 94
  • 128
1

As a cryptographic hash (even one nowadays considered broken), MD5 has no significant correlation between input and output bits. That means, simply taking the first or last half will give you a perfectly well-distributed hash function. Anything else would never have been seriously considered as a cryptographic hash.

Michael Borgwardt
  • 342,105
  • 78
  • 482
  • 720
0

What about using some block cipher with 64bit block size ?

Rostislav Matl
  • 4,294
  • 4
  • 29
  • 53