61

To quote from Guidelines and rules for GetHashCode by Eric Lippert:

Rule: Consumers of GetHashCode cannot rely upon it being stable over time or across appdomains

Suppose you have a Customer object that has a bunch of fields like Name, Address, and so on. If you make two such objects with exactly the same data in two different processes, they do not have to return the same hash code. If you make such an object on Tuesday in one process, shut it down, and run the program again on Wednesday, the hash codes can be different.

This has bitten people in the past. The documentation for System.String.GetHashCode notes specifically that two identical strings can have different hash codes in different versions of the CLR, and in fact they do. Don't store string hashes in databases and expect them to be the same forever, because they won't be.

So what is the correct way to create a HashCode of a string that I can store in a database?

(Please tell me I am not the first person to have left this bug in software I have written!)

Eric Lippert
  • 647,829
  • 179
  • 1,238
  • 2,067
Ian Ringrose
  • 51,220
  • 55
  • 213
  • 317
  • 3
    Well, I never rely on GetHashCode, because I know, how sloppy I implement this method. I believe others aren't doing it any better... ;-) – Daniel Hilgarth Mar 01 '11 at 13:17
  • 3
    You're not the first person which has left this bug in software which you've written. – Bobby Mar 01 '11 at 13:19
  • 3
    Dbase engines are already very good at hashing strings. Just create an index for the column. – Hans Passant Mar 01 '11 at 14:16
  • @Hans, see http://stackoverflow.com/questions/2730865/how-do-i-calculate-a-good-hash-code-for-a-list-of-strings, don't assume that the string is stored in a single table. – Ian Ringrose Mar 01 '11 at 14:34
  • 2
    Why does that matter? You'd index the columns that are used in a join anyway to make the query fast. Sounds to me you are trying to do the dbase engine's job. – Hans Passant Mar 01 '11 at 14:38
  • @HansPassant is totally correct here, trying to do the DB's job seems like a bad idea – Scuba Steve Oct 09 '18 at 21:10
  • The answer is to just write your own hashing function. You can find source for some by following links in the comments to the article you posted. Or you can use a built-in hash function that's originally intended for cryptography (MD5, SHA1, etc.) and just not use all of the bits. – Gabe Mar 01 '11 at 13:18

3 Answers3

86

It depends what properties you want that hash to have. For example, you could just write something like this:

public int HashString(string text)
{
    // TODO: Determine nullity policy.

    unchecked
    {
        int hash = 23;
        foreach (char c in text)
        {
            hash = hash * 31 + c;
        }
        return hash;
    }
}

So long as you document that that is how the hash is computed, that's valid. It's in no way cryptographically secure or anything like that, but you can persist it with no problems. Two strings which are absolutely equal in the ordinal sense (i.e. with no cultural equality etc applied, exactly character-by-character the same) will produce the same hash with this code.

The problems come when you rely on undocumented hashing - i.e. something which obeys GetHashCode() but is in no way guaranteed to remain the same from version to version... like string.GetHashCode().

Writing and documenting your own hash like this is a bit like saying, "This sensitive information is hashed with MD5 (or whatever)". So long as it's a well-defined hash, that's fine.

EDIT: Other answers have suggested using cryptographic hashes such as SHA-1 or MD5. I would say that until we know there's a requirement for cryptographic security rather than just stability, there's no point in going through the rigmarole of converting the string to a byte array and hashing that. Of course if the hash is meant to be used for anything security-related, an industry-standard hash is exactly what you should be reaching for. But that wasn't mentioned anywhere in the question.

Jon Skeet
  • 1,421,763
  • 867
  • 9,128
  • 9,194
  • 7
    Is there anything magic about 23 and `* 31`? Rather, any reason to choose those over any other values? ... over any other [documented] hashing method? I'm guessing no, though 31 being one less than ASCII printables has kept me unnecessarily suspicious. – ruffin Jul 20 '15 at 22:17
  • 18
    @ruffin: They're values recommended by Josh Bloch. Multiplying by 31 is efficient because it can be done as a shift and a subtract. There are various other questions talking about this - it's a bit of a dark art, to be honest. – Jon Skeet Jul 21 '15 at 05:57
  • 23
    Neat! From [Effective Java (2008), page 48](https://books.google.com/books?id=ka2VUBqHiWkC): *The value 31 was chosen because it is an odd prime. If it were even and the multiplication overflowed, information would be lost, as multiplication is equivalent to shifting. The advantage of using a prime is less clear, but it is traditional. A nice property of 31 is that the multiplication can be replaced by a shift and a subtraction for better performance: `31 * i == (i << 5) - i`. Modern VMs do this sort of optimization automatically.* Looks like some fun reading; thanks again. – ruffin Jul 21 '15 at 20:24
  • 2
    "Modern VMs do this sort of optimization automatically" does this apply to .NET too? – geekley Jul 20 '18 at 17:59
25

Here is a reimplementation of the current way .NET calculates it's string hash code for 64 bit systems. This does not use pointers like the real GetHashCode() does so it will be slightly slower, but it does make it more resilient to internal changes to string, this will give a more evenly distributed hash code than Jon Skeet's version which may result in better lookup times in dictionaries.

public static class StringExtensionMethods
{
    public static int GetStableHashCode(this string str)
    {
        unchecked
        {
            int hash1 = 5381;
            int hash2 = hash1;

            for(int i = 0; i < str.Length && str[i] != '\0'; i += 2)
            {
                hash1 = ((hash1 << 5) + hash1) ^ str[i];
                if (i == str.Length - 1 || str[i+1] == '\0')
                    break;
                hash2 = ((hash2 << 5) + hash2) ^ str[i+1];
            }

            return hash1 + (hash2*1566083941);
        }
    }
}
Scott Chamberlain
  • 124,994
  • 33
  • 282
  • 431
1

There is now the System.IO.Hashing package that provides stable and standardized non-cryptographic hash algorithms. While they are designed for byte sequences, it is fairly straightforward to use them safely and very efficiently through Span:

var input = "Hello world";
var inputBytes = MemoryMarshal.AsBytes(input.AsSpan());
var hash = System.IO.Hashing.XxHash32.HashToUInt32(inputBytes);
Console.WriteLine(hash); // 899079058

Note however that, due to the reinterpretation of characters as bytes, the endianness of the system affects the result, so if you move to a big-endian system, the hash above will be different. If that is an issue, you can check BitConverter.IsLittleEndian and swap the bytes if it's false.

IS4
  • 11,945
  • 2
  • 47
  • 86