Does String.GetHashCode consider the full string or only part of it?

Question

I'm just curious because I guess it will have impact on performance. Does it consider the full string? If yes, it will be slow on long string. If it only consider part of the string, it will have bad performance (e.g. if it only consider the beginning of the string, it will have bad performance if a HashSet contains mostly strings with the same.

"it will have bad performance" - bad compared with what alternative? Obviously storing very, very long strings in a HashSet will be slower than storing short ones, but how often is that done? — Joe, Mar 02 '13 at 12:40
On computer `"".GetHashCode() == 371857150`. Is that the same for everyone? — Colonel Panic, Mar 18 '16 at 10:01
@ColonelPanic That's what it looks like from the code posted below, assuming everyone is using a release version of the runtime. csharppad.com produces the same value as you for `"".GetHashCode()` — Millie Smith, Oct 14 '16 at 19:17

Hans Passant · Accepted Answer · 2015-09-14T15:24:34.617

Be sure to obtain the Reference Source source code when you have questions like this. There's a lot more to it than what you can see from a decompiler. Pick the one that matches your preferred .NET target, the method has changed a great deal between versions. I'll just reproduce the .NET 4.5 version of it here, retrieved from Source.NET 4.5\4.6.0.0\net\clr\src\BCL\System\String.cs\604718\String.cs

        public override int GetHashCode() { 

#if FEATURE_RANDOMIZED_STRING_HASHING
            if(HashHelpers.s_UseRandomizedStringHashing)
            { 
                return InternalMarvin32HashString(this, this.Length, 0);
            } 
#endif // FEATURE_RANDOMIZED_STRING_HASHING 

            unsafe { 
                fixed (char *src = this) {
                    Contract.Assert(src[this.Length] == '\0', "src[this.Length] == '\\0'");
                    Contract.Assert( ((int)src)%4 == 0, "Managed string should start at 4 bytes boundary");

#if WIN32
                    int hash1 = (5381<<16) + 5381; 
#else 
                    int hash1 = 5381;
#endif 
                    int hash2 = hash1;

#if WIN32
                    // 32 bit machines. 
                    int* pint = (int *)src;
                    int len = this.Length; 
                    while (len > 2) 
                    {
                        hash1 = ((hash1 << 5) + hash1 + (hash1 >> 27)) ^ pint[0]; 
                        hash2 = ((hash2 << 5) + hash2 + (hash2 >> 27)) ^ pint[1];
                        pint += 2;
                        len  -= 4;
                    } 

                    if (len > 0) 
                    { 
                        hash1 = ((hash1 << 5) + hash1 + (hash1 >> 27)) ^ pint[0];
                    } 
#else
                    int     c;
                    char *s = src;
                    while ((c = s[0]) != 0) { 
                        hash1 = ((hash1 << 5) + hash1) ^ c;
                        c = s[1]; 
                        if (c == 0) 
                            break;
                        hash2 = ((hash2 << 5) + hash2) ^ c; 
                        s += 2;
                    }
#endif
#if DEBUG 
                    // We want to ensure we can change our hash function daily.
                    // This is perfectly fine as long as you don't persist the 
                    // value from GetHashCode to disk or count on String A 
                    // hashing before string B.  Those are bugs in your code.
                    hash1 ^= ThisAssembly.DailyBuildNumber; 
#endif
                    return hash1 + (hash2 * 1566083941);
                }
            } 
        }

This is possibly more than you bargained for, I'll annotate the code a bit:

The #if conditional compilation directives adapt this code to different .NET targets. The FEATURE_XX identifiers are defined elsewhere and turn features off whole sale throughout the .NET source code. WIN32 is defined when the target is the 32-bit version of the framework, the 64-bit version of mscorlib.dll is built separately and stored in a different subdirectory of the GAC.
The s_UseRandomizedStringHashing variable enables a secure version of the hashing algorithm, designed to keep programmers out of trouble that do something unwise like using GetHashCode() to generate hashes for things like passwords or encryption. It is enabled by an entry in the app.exe.config file
The fixed statement keeps indexing the string cheap, avoids the bounds checking done by the regular indexer
The first Assert ensures that the string is zero-terminated as it should be, required to allow the optimization in the loop
The second Assert ensures that the string is aligned to an address that's a multiple of 4 as it should be, required to keep the loop performant
The loop is unrolled by hand, consuming 4 characters per loop for the 32-bit version. The cast to int* is a trick to store 2 characters (2 x 16 bits) in a int (32-bits). The extra statements after the loop deal with a string whose length is not a multiple of 4. Note that the zero terminator may or may not be included in the hash, it won't be if the length is even. It looks at all the characters in the string, answering your question
The 64-bit version of the loop is done differently, hand-unrolled by 2. Note that it terminates early on an embedded zero, so doesn't look at all the characters. Otherwise very uncommon. That's pretty odd, I can only guess that this has something to do with strings potentially being very large. But can't think of a practical example
The debug code at the end ensures that no code in the framework ever takes a dependency on the hash code being reproducible between runs.
The hash algorithm is pretty standard. The value 1566083941 is a magic number, a prime that is common in a Mersenne twister.

"it terminates early on an embedded zero" - That is bizarre. I tested it and sure enough the 64bit version ignores chars after a \0 (32bit doesn't). And since NULL is a valid unicode character, this is technically a bug IMO. — redcalx, Feb 18 '14 at 17:48
Logged on microsoft connect ~ http://connect.microsoft.com/VisualStudio/feedback/details/817902/string-gethashcode-ignores-all-chars-after-a-0-in-a-64bit-environment — redcalx, Feb 18 '14 at 18:03
@locster the bug was closed with "as By Design". Has anyone found an explanation why the 64-bit version doesn't hash characters after \0? — data, Jul 02 '14 at 10:31
If MS claim it was by design then they need to give an explanation, otherwise the reasonable assumption is that it is an error but perhaps one they can no longer fix without risks. In live systems there is a .Net config setting to specify use of an alternative hashing scheme, hence there is a simple workaround. Still seems dumb to not address this given that's it's such a key class in the system, and 64 bit is becoming the norm. — redcalx, Jul 03 '14 at 10:48
"designed to keep programmers out of trouble that do something unwise like using GetHashCode() to generate hashes for things like passwords or encryption" It's not going to help you a lot there, a bit but not a lot. It would help if you were hashing strings that came direct from user-input leaving you open to hash-dos attacks. — Jon Hanna, Jul 07 '15 at 09:28
It seems to me that they used \0 to terminate because they don't have exact length stored anywhere while outside of WIN32 block. So they went for the null termination. String class' fields may differ for platforms. Just a guess. — Bahadir Acar, Sep 14 '15 at 14:49
How come is not computed only once since string is immutable? — Tamas Ionut, Feb 10 '16 at 15:41
The 64-bit version of the loop is not unrolled, in the usual sense. The two stages operate on `hash1` and `hash2` respectively, which are then combined when the loop exits. The zero check is to test for the end of the null-terminated string (in odd positions), as is done in the `while` condition (for even positions). — Drew Noakes, Apr 10 '18 at 00:12
@HansPassant You could store the hash code basically anywhere in the string object. Before the length field, between the length field and the string data or even after the string data. — Paul Groke, Oct 16 '18 at 15:30
hash code could be stored in the string class itself? or maybe its impossible to change this implementation because of backwards compatibility? — M.kazem Akhgary, Dec 31 '18 at 21:32
@JonHanna: To my knowledge, that's the actual reason it exists; passwords and encryption "benefits" are accidental, the real goal is ensuring that web-facing applications can't be attacked with inputs (that are stored in a dictionary) that will reliably collide on the hash code (changing `O(1)` optimal and normal case operations into `O(n)` worst case performance for every lookup, insertion, etc.). — ShadowRanger, Jan 23 '20 at 20:05
Just out of the topic....the while loop confirms that GetHashCode() is a linear operation. — Yawar Murtaza, Aug 14 '20 at 12:11

Ergwun · Answer 2 · 2013-03-02T12:48:03.313

8

Examining the source code (courtesy of ILSpy), we can see that it does iterate over the length of the string.

// string
[ReliabilityContract(Consistency.WillNotCorruptState, Cer.MayFail), SecuritySafeCritical]
public unsafe override int GetHashCode()
{
    IntPtr arg_0F_0;
    IntPtr expr_06 = arg_0F_0 = this;
    if (expr_06 != 0)
    {
        arg_0F_0 = (IntPtr)((int)expr_06 + RuntimeHelpers.OffsetToStringData);
    }
    char* ptr = arg_0F_0;
    int num = 352654597;
    int num2 = num;
    int* ptr2 = (int*)ptr;
    for (int i = this.Length; i > 0; i -= 4)
    {
        num = ((num << 5) + num + (num >> 27) ^ *ptr2);
        if (i <= 2)
        {
            break;
        }
        num2 = ((num2 << 5) + num2 + (num2 >> 27) ^ ptr2[(IntPtr)4 / 4]);
        ptr2 += (IntPtr)8 / 4;
    }
    return num + num2 * 1566083941;
}

edited Mar 02 '13 at 12:48

answered Mar 02 '13 at 12:34

Ergwun

12,579
7
56
83

1

yeah, I saw that, but I have no clue what it does :o – Louis Rhys Mar 02 '13 at 12:35
wait. On second reading, it seems that it is different from the code in my ILSpy. Mine doesn't have the for loop over Length. Maybe it is implemented differently in different platform – Louis Rhys Mar 02 '13 at 12:41
2

Um, it hashes the string. You did say you wanted to know what it does, so there it is. The string hash algorithm is different for different versions of the CLR. – Raymond Chen Mar 02 '13 at 12:41
@LouisRhys - that was the one from .NET 2.0 (since I already had that loaded in ILSpy). I've replaced it with the one from .NET 4.0 - looks very similar. – Ergwun Mar 02 '13 at 12:47

Does String.GetHashCode consider the full string or only part of it?

2 Answers2

Linked

Related