63

I'm looking for a hash function that:

  1. Hashes textual strings well (e.g. few collisions)
  2. Is written in Java, and widely used
  3. Bonus: works on several fields (instead of me concatenating them and applying the hash on the concatenated string)
  4. Bonus: Has a 128-bit variant.
  5. Bonus: Not CPU intensive.
jasonmp85
  • 6,749
  • 2
  • 25
  • 41
ripper234
  • 222,824
  • 274
  • 634
  • 905
  • 25
    The following link has several implementations of general purpose hash functions that are efficient and exhibit minimal collisions: http://www.partow.net/programming/hashfunctions/index.html –  Jan 01 '11 at 10:48

9 Answers9

74

Why don't you use a long variant of the default String.hashCode() (where some really smart guys certainly put effort into making it efficient - not mentioning the thousands of developer eyes that already looked at this code)?

// adapted from String.hashCode()
public static long hash(String string) {
  long h = 1125899906842597L; // prime
  int len = string.length();

  for (int i = 0; i < len; i++) {
    h = 31*h + string.charAt(i);
  }
  return h;
}

If you're looking for even more bits, you could probably use a BigInteger Edit:

As I mentioned in a comment to the answer of @brianegge, there are not much usecases for hashes with more than 32 bits and most likely not a single one for hashes with more than 64 bits:

I could imagine a huge hashtable distributed across dozens of servers, maybe storing tens of billions of mappings. For such a scenario, @brianegge still has a valid point here: 32 bit allow for 2^32 (ca. 4.3 billion) different hash keys. Assuming a strong algorithm, you should still have quite few collisions. With 64 bit (18,446,744,073 billion different keys) your certainly save, regardless of whatever crazy scenario you need it for. Thinking of usecases for 128 bit keys (340,282,366,920,938,463,463,374,607,431 billion possible keys) is pretty much impossible though.

To combine the hash for several fields, simply do an XOR multiply one with a prime and add them:

long hash = MyHash.hash(string1) * 31 + MyHash.hash(string2);

The small prime is in there to avoid equal hash code for switched values, i.e. {'foo','bar'} and {'bar','foo'} aren't equal and should have a different hash code. XOR is bad as it returns 0 if both values are equal. Therefore, {'foo','foo'} and {'bar','bar'} would have the same hash code.

sfussenegger
  • 35,575
  • 15
  • 95
  • 119
  • Note that for Strings <= 5 chars, the upper 32bits will be 0. So this will only make a difference for longer strings. – Aaron Digulla Nov 02 '09 at 13:09
  • Good catch. A higher prime and non-null start value should help. – sfussenegger Nov 02 '09 at 13:43
  • I've chosen a quite high 64bit prime as a start value. As a result, hashes for Strings <= 5 chars shouldn't be 0 in the first 32bits. However, on second thought I doubt that having 32 0s at the beginning hurt the properties of the hash function. I kept 31 as a second prime as this is what's used in String.hashCode() too (Which still is very close to what I suggest here) – sfussenegger Nov 02 '09 at 14:14
  • 3
    The String.hashCode is only good for fast calculation (`h * 31` is optimised to `(i << 5) - i`), but it is poor for avoiding textual hash collisions (the strings "FB" and "Ea" collide). Since this 64-bit hash is the same, it has essentially the same weakness. – Luke Usherwood Jul 30 '15 at 04:41
  • 2
    Regarding use-cases for 64-bit and 128-bit keys: maybe you want a hash as a unique signature for strings, to avoid storing the string content entirely. Here you'd better (a) use a cryptographic-strength hash, and (b) don't forget the "birthday problem" (ask wikipedia) when selecting the hash size. Even a "good" 64-bit hash has to be assumed to have a lower # bits of "security" (say 32) - so together with the BP it starts to look dodgy as a unique signature even for sets of over 100,000 strings. There's a reason git went with 160-bit SHA-1 :-) – Luke Usherwood Jul 30 '15 at 05:08
  • I am trying to generate a hashcode from a password string I get from an input, and this method gives a long number. It is not the same as the md5 hashcode I received for the string when using an online string-to-md5hashcode generator. Can someone show me how I can get the md5 hashcode of a string in Java? For example, if I use an online hashcode egenerator for the string "hello", I get a mixture of letters and number: 5d41402abc4b2a76b9719d911017c592. The method above gives me something like 6487232239950166829. – naz786 Oct 25 '17 at 18:06
  • @LukeUsherwood You are misinformed. Multiplication is faster in interpreted languages because they get optimized down. If you use shifts like that in an interpreted language, it is slower because it isn't going to be as optimized. This Java hash function is very bad, over 40 years old, and should be obsoleted by newer, better functions. – bryc Sep 08 '21 at 18:55
  • sorry -1 but bad formula. I tried (at random), and get easily collisions: 0AS and 0B4 -3351804022671247779 – guillaume girod-vitouchkina Jul 30 '22 at 14:30
  • @guillaumegirod-vitouchkina fair enough, I haven't been too high on this answer since I've wrote it in 2009. I've actually thought about deleting it a couple of times already but since no other answer has gained significant support I've thought I leave it as is. Too be fair though, `String.hashCode()` has a collision for the same values too. Same for all strings with equals suffix followed by either "AS" or "B4" actually. – sfussenegger Aug 17 '22 at 08:11
  • @guillaumegirod-vitouchkina Reason is that difference between '4' and 'S' is 31, the multiplier used in `String.hashCode()`. It would probably make sense to use a multiplier that's a prime > `Character.MAX_VALUE` but that's deviating from the original implementation quite a bit, especially in performance. See https://stackoverflow.com/a/299748/178526 – sfussenegger Aug 17 '22 at 08:14
  • @guillaumegirod-vitouchkina so the collision will be there in `String.hashCode()` for any 2 strings with any common prefix and ending in `"str" + c1 + c2` and `"str" + (c1 + 1) + (c2 - 31)`. – sfussenegger Aug 17 '22 at 08:21
6

An answer for today (2018). SipHash.

It will be much faster than most of the answers here, and significantly higher quality than all of them.

The Guava library has one: https://google.github.io/guava/releases/23.0/api/docs/com/google/common/hash/Hashing.html#sipHash24--

Scott Carey
  • 1,587
  • 17
  • 13
  • The accepted answer (Java's string hash) is insultingly simple, to the point where it is poor hash. SipHash is in many ways, over-engineered, over complicated and not very fast in interpreted languages due to the sheer amount of required operations and rounds. MurmurHash exists in the middle, where it is fast, but not extremely weak. – bryc Sep 08 '21 at 18:57
4

Create an SHA-1 hash and then mask out the lowest 64bits.

Aaron Digulla
  • 321,842
  • 108
  • 597
  • 820
  • You could even do an XOR of the first and last 64 bits. But isn't a SHA-1 hash a little over the top? If a cryptographically secure hash isn't necessary, you definitely lost some points on requirement 5 ;) – sfussenegger Nov 02 '09 at 11:05
  • 5
    @sfussenegger: Don't try to add random randomness. XOR doesn't always help. Even clipping the hash can have unpredictable results. Give it a try with a few million test cases or understand the mathematics behind it. Otherwise, you just make things worse with such a "blind improvement". – Aaron Digulla Nov 02 '09 at 13:06
  • 7
    It's not about adding random randomness. The idea was simply to keep all bits of the SHA-1 hash which is **designed to be uniformly distributed**. Therefore, there shouldn't be any unexpected side effect - but at second though it's useless overhead in the end. Clipping doesn't have unpredictable results as well - because that's exactly what e.g. HashMap.indexFor(int,int) does to map a hash to an index of the hashed table. So it really doesn't matter if you clip a 128 bit hash to 64 bit, as it will be further clipped to fit the hashed table anyway. – sfussenegger Nov 02 '09 at 14:08
  • It really depends on the properties of the bits. Folding the hash this way can create unexpected clumps which *shouldn't* happen with clipping. But clipping can also fail if the various strings produce with very similar lower 32bits. That's why it's usually better not to try to "improve" an existing algorithm without know its exact properties. I'm talking here because I once created a hash which didn't hash very well :) – Aaron Digulla Nov 02 '09 at 14:19
  • I think we all did it at least once ;) I'd be very interested to see if either approach causes problems with SHA-1. Maybe somebody finds some time to run a simple test? – sfussenegger Nov 02 '09 at 15:30
2
long hash = string.hashCode();

Yes, the top 32 bits will be 0, but you will probably run out of hardware resources before you run into problems with hash collisions. The hashCode in String is quite efficient and well tested.

Update I think the above satisfies the simplest thing which could possibly work, however, I agree with @sfussenegger idea of extending the existing String hashCode.

In addition to having a good hashCode for your String, you may want to consider rehashing the hash code in your implementation. If your storage is used by other developers, or used with other types, this can help distributed your keys. For example, Java's HashMap is based on power-of-two length hash tables, so it adds this function to ensure the lower bits are sufficiently distributed.

    h ^= (h >>> 20) ^ (h >>> 12);
    return h ^ (h >>> 7) ^ (h >>> 4);
brianegge
  • 29,240
  • 13
  • 74
  • 99
  • 2
    -1 Hardware resources is not the issue here - I didn't specify how the hash value to be used, but I promise you it's not "storing n values in a hashmap". I will get collisions if I process enough items before I run into hardware problems. – ripper234 Nov 02 '09 at 12:03
  • I could imagine a *huge* hashtable distributed across dozens of servers, maybe storing tens of billions of mappings. For such a scenario, @brianegge still has a valid point here: 32 bit allow for 2^32 (ca. 4.3 billion) different hash keys. Assuming a strong algorithm, you should still have quite few collisions. With 64 bit (18,446,744,073 billion different keys) your almost certainly save, regardless of whatever crazy scenario you need it for. Thinking of usecases for 128 bit keys (340,282,366,920,938,463,463,374,607,431 billion) is quite impossible though. – sfussenegger Nov 02 '09 at 12:44
  • The main point here: A couple of people at Sun and all over the world have improved this algorithm over the past ten years. It's unlikely that you can come up with something batter without investing at least a week or so doing an extensive research on the distribution properties of your strings. – Aaron Digulla Nov 02 '09 at 13:07
  • A hash table with a 32-bit hash will start running into problems after about 65,536 entries. A 32-bit hash won't suffice for a hash table with millions of entries. Even concatenated 32-bit hashes don't work; tried it once and experienced a massive slowdown – Seun Osewa Nov 27 '09 at 15:40
2

Why not use a CRC64 polynomial. These are reasonably efficient and optimized to make sure all bits are counted and spread over the result space.

There are plenty of implementations available on the net if you google "CRC64 Java"

Peter Tillemans
  • 34,983
  • 11
  • 83
  • 114
1

Reverse the string to get another 32-bit hashcode and then combine the two:

String s = "astring";
long upper = ( (long) s.hashCode() ) << 32;
long lower = ( (long) s.reverse().hashCode() ) - ( (long) Integer.MIN_VALUE );
long hash64 = upper + lower;

This is pseudocode; the String.reverse() method doesn't exist and will need to be implemented some other way.

1

Do something like this:

import java.io.ByteArrayOutputStream;
import java.io.DataOutputStream;
import java.io.IOException;
import java.math.BigInteger;
import java.security.MessageDigest;
import java.security.NoSuchAlgorithmException;

public class Test {

    public static void main(String[] args) throws NoSuchAlgorithmException,
            IOException {
        ByteArrayOutputStream baos = new ByteArrayOutputStream();
        DataOutputStream dos = new DataOutputStream(baos);

        try {
            MessageDigest md = MessageDigest.getInstance("MD5");
            SomeObject testObject = new SomeObject();

            dos.writeInt(testObject.count);
            dos.writeLong(testObject.product);
            dos.writeDouble(testObject.stdDev);
            dos.writeUTF(testObject.name);
            dos.writeChar(testObject.delimiter);
            dos.flush();

            byte[] hashBytes = md.digest(baos.toByteArray());
            BigInteger testObjectHash = new BigInteger(hashBytes);

            System.out.println("Hash " + testObjectHash);
        } finally {
            dos.close();
        }
    }

    private static class SomeObject {
        private int count = 200;
        private long product = 1235134123l;
        private double stdDev = 12343521.456d;
        private String name = "Test Name";
        private char delimiter = '\n';
    }
}

DataOutputStream lets you write primitives and Strings and have them output as bytes. Wrapping a ByteArrayOutputStream in it will let you write to a byte array, which integrates nicely with MessageDigest. You can pick from any algorithm listed here.

Finally BigInteger will let you turn the output bytes into an easier-to-use number. The MD5 and SHA1 algorithms both produce 128-bit hashes, so if you need 64 you can just truncate.

SHA1 should hash almost anything well, and with infrequent collisions (it's 128-bit). This works from Java, but I'm not sure how it's implemented. It may actually be fairly fast. It works on several fields in my implementation: just push them all onto the DataOutputStream and you're good to go. You could even do it with reflection and annotations (maybe @HashComponent(order=1) to show which fields go into a hash and in what order). It's got a 128-bit variant and I think you'll find it doesn't use as much CPU as you think it will.

I've used code like this to get hashes for huge data sets (by now probably billions of objects) to be able to shard them across many backend stores. It should work for whatever you need it for. Note that I think you may want to only call MessageDigest.getInstance() once and then clone() from then on: IIRC the cloning is a lot faster.

jasonmp85
  • 6,749
  • 2
  • 25
  • 41
0

Do you look at Apache commons lang?

But for 64 bit (and 128) you need some tricks: the rules laid out in the book Effective Java by Joshua Bloch help you create 64 bit hash easy (just use long instead of int). For 128 bit you need additional hacks...

St.Shadow
  • 1,840
  • 1
  • 12
  • 16
  • Commons-lang won't help at all for hashes larger than the standard 32-bit ones. It does those very well, but beyond that not so much. – jasonmp85 Jun 03 '10 at 10:59
-2

DISCLAIMER: This solution is applicable if you wish to efficiently hash individual natural language words. It is inefficient for hashing longer text, or text containing non-alphabetic characters.

I'm not aware of a function but here's an idea that might help:

  • Dedicate 52 of the 64 bits to representing which letters are present in the String. For example, if 'a' were present you'd set bit[0], for 'b' set bit 1, for 'A' set bit[26]. That way, only text containing exactly the same set of letters would have the same "signature".

You could then use the remaining 12 bits to encode the string length (or a modulo value of it) to further reduce collisions, or generate a 12 bit hashCode using a traditional hashing function.

Assuming your input is text-only I can imagine this would result in very few collisions and would be inexpensive to compute (O(n)). Unlike other solutions so far this approach takes the problem domain into account to reduce collisions - It is based off the Anagram Detector described in Programming Pearls (see here).

Adamski
  • 54,009
  • 15
  • 113
  • 152
  • -1 The longer the strings get, the more collisions you'll get. Additionally the hash is weak as (assuming natural language) most strings will contain vowels and frequent consonants (it's even possible to quite reliably guess the language of a string by frequent consonants and vowels by the way). – sfussenegger Nov 02 '09 at 11:09
  • 1
    @sfussenegger: The OP mentions the need to has textual strings well, which implies some upper limit on string length (e.g. longest word in the English language is only 45 characters). Additionally, the OP makes no mention that this needs to be a secure hash. – Adamski Nov 02 '09 at 11:20
  • 1
    Why does this requirement imply an upper limit on string length? This hash function is extremely weak - regardless of being secure or not. – sfussenegger Nov 02 '09 at 11:26
  • I read "textual strings" to imply natural language. If this isn't the case then I agree my approach is invalid. Why do you think it's weak? – Adamski Nov 02 '09 at 11:28
  • 2
    I suspect that this algo gives *lots of clashes* with short words: `rate == tear`, `rose == sore` etc etc. – oxbow_lakes Nov 02 '09 at 11:29
  • @oxbow_lakes: You're correct. In this case perhaps the remaining 12 bits should be used to store the result of String.hashCode() (modulo 4096). – Adamski Nov 02 '09 at 11:32
  • a) it doesn't work for special characters, numbers and the like b) see @oxbow_lakes' comment plus cases with duplicated characters (lose == loose) c) it won't work for long texts as they will have close to all character flags set ... I might even find more reasons, but that should enough for now – sfussenegger Nov 02 '09 at 11:36
  • @Adamski: if String.hashCode() is the strongest part of your hash function, why don't replace the whole function with String.hashCode()? If you change it from 32 to 64 bit - voila - you get to the exact approach I've suggested :) ... and I can't believe somebody voted for this suggestion at all. – sfussenegger Nov 02 '09 at 11:38
  • 1
    @sfussenegger: I agree with *all* your points. My solution assumes that the OP wants to hash individual natural language words. – Adamski Nov 02 '09 at 11:39
  • @sfussenegger: The approach you suggested is entirely valid. However, I was merely trying to suggest something that could help the OP if their *problem domain was sufficiently narrow*. If not then sure, they can call String.hashCode(), but given ripper's rep I imagine he already knows how to do this. – Adamski Nov 02 '09 at 11:42