13

If you use Visual Studio's own refactoring menu to add a GetHashCode implementation to a class like this:

Generate GetHashCode menu

and select the only int property in the class:

Member selection screen

it generates this code on .NET Framework:

public override int GetHashCode()
{
    return -1937169414 + Value.GetHashCode();
}

(it generates HashCode.Combine(Value) on .NET Core instead, which I'm not sure if it involves the same value)

What's special about this value? Why doesn't Visual Studio use Value.GetHashCode() directly? As I understand, it doesn't really affect hash distribution. Since it's just addition, consecutive values would still accumulate together.

EDIT: I only tried this with different classes with Value properties but apparently property name affects the number generated. For instance, if you rename the property to Halue, the number becomes 387336856. Thanks to Gökhan Kurt who pointed this out.

Sedat Kapanoglu
  • 46,641
  • 25
  • 114
  • 148
  • See https://learn.microsoft.com/en-us/dotnet/api/system.string.gethashcode?view=netcore-3.1 in the remarks-section. "Hash codes for identical strings can differ across .NET implementations, across .NET versions, and across .NET platforms (such as 32-bit and 64-bit) for a single version of .NET. In some cases, they can even differ by application domain" – Link Apr 30 '20 at 07:23
  • @Link how is that relevant? that's not even a string, the property is an `int`. – Sedat Kapanoglu Apr 30 '20 at 07:24
  • Sorry wrong link: https://learn.microsoft.com/en-us/dotnet/api/system.object.gethashcode?view=netcore-3.1 This behavior applies also to Object.GetHashcode @SedatKapanoglu – Link Apr 30 '20 at 07:27
  • https://source.dot.net/#System.Private.CoreLib/HashCode.cs,ea15142a54827dbb has a comment that might apply. I haven’t used .NET in a while – are there really built-in types with “bad” `GetHashCode` implementations? If there are, that would explain it. Kind of seems like it should be on the implementer, though… – Ry- Apr 30 '20 at 07:29
  • My guess would be that it avoids collisions between different classes and structs that have similar implementions (like relying on a single `int`). Although that's a mere guess. – NotFound Apr 30 '20 at 07:32
  • It's *magic number* , some mathematic with prime numbers and impoving hash (reducing collisions), see [this answer](https://stackoverflow.com/a/263416/1997232). Why is it `-1937169414`? Someone decided. Is it good? It should be good enough. Do you need to change it? Nope, but you can try if you know what you are doing (seems you are not). From other comments it should be already clear that *"if it involves the same value"* is a bad thought, you shouldn't rely that the value will be the same in different systems. – Sinatr Apr 30 '20 at 08:14
  • @Sinatr What exactly is a magic number? In the answer I see no notion of a magic number; only prime numbers and it's easy to see that `-1937169414` isn't a prime number (its negative and divisible by 2). – NotFound Apr 30 '20 at 08:40
  • @404, magic number is a number which doesn't reveal its meaning (magic - something you can't explain), e.g. writing `5` instead of `0b101` for some bit operations. The number `-1937169414` has to be prime (otherwise it will not be good), but looking at binary value it also has all bits of low word set, shift + prime number add? – Sinatr Apr 30 '20 at 09:06
  • 2
    `-1937169414` is integer multiplication of `-1521134295` and `-783812246`. The more significant number here is `-1521134295` which appears in every hash code calculation. `-783812246` is seed number. A seed number is chosen based on number of members in equation. In anonymous classes, seed number is calculated based on field names. So there are as many seed numbers as there are integers. We can assume a seed number is random. As for the significance of `-1521134295`, I think it reduces collision and only an inside developer would be able to answer precisely how. – Gokhan Kurt Apr 30 '20 at 16:25
  • @Sinatr but it isn't a prime number? – Sedat Kapanoglu Apr 30 '20 at 18:23
  • @GökhanKurt thanks! as you said, it's generated differently based on the field names. it looks like it's been done to put hash code values at different offsets for different classes so when combined, they are more distributed. – Sedat Kapanoglu Apr 30 '20 at 18:29

2 Answers2

4

As GökhanKurt explained in the comments, the number changes based upon the property names involved. If you rename the property to Halue, the number becomes 387336856 instead. I had tried it with different classes but didn't think of renaming the property.

Gökhan's comment made me understand its purpose. It's offsetting hash values based on a deterministic, but randomly distributed offset. This way, combining hash values for different classes, even with a simple addition, is still slightly resistant to hash collisions.

For instance, if you have two classes with a similar GetHashCode implementations:

public class A
{
    public int Value { get; set;}
    public int GetHashCode() => Value;
}

public class B
{
    public int Value { get; set;}
    public override int GetHashCode() => Value;
}

and if you have another class that contains references to these two:

public class C
{
    public A ValueA { get; set; }
    public B ValueB { get; set; }
    public override int GetHashCode()
    {
        return ValueA.GetHashCode() + ValueB.GetHashCode();
    }
}

a poor combination like this would be prone to hash collisions because the resulting hash code would accumulate around the same area for different values of ValueA and ValueB if their values are close to each other. It really doesn't matter if you use multiplication or bitwise operations to combine them, they would still be prone to collisions without an evenly distanced offset. As many integer values used in programming are accumulated around 0, it makes sense to use such an offset

Apparently, it's a good practice to have a random offset with good bit patterns.

I'm still not sure why they don't use completely random offsets, probably not to break any code that relies on determinism of GetHashCode(), but it would be great to receive a comment from Visual Studio team about this.

Sedat Kapanoglu
  • 46,641
  • 25
  • 114
  • 148
4

If you look for -1521134295 in Microsoft's repositories you'll see that it appears quite a number of times

Most of the search results are in the GetHashCode functions, but they all have the following form

int hashCode = SOME_CONSTANT;
hashCode = hashCode * -1521134295 + field1.GetHashCode();
hashCode = hashCode * -1521134295 + field2.GetHashCode();
// ...
return hashCode;

The first hashCode * -1521134295 = SOME_CONSTANT * -1521134295 will be pre-multiplied during the generation time by the generator or during compilation time by CSC. That's the reason for -1937169414 in your code

Digging deeper into the results reveals the code generation part which can be found in the function CreateGetHashCodeMethodStatements

const int hashFactor = -1521134295;

var initHash = 0;
var baseHashCode = GetBaseGetHashCodeMethod(containingType);
if (baseHashCode != null)
{
    initHash = initHash * hashFactor + Hash.GetFNVHashCode(baseHashCode.Name);
}

foreach (var symbol in members)
{
    initHash = initHash * hashFactor + Hash.GetFNVHashCode(symbol.Name);
}

As you can see the hash depends on the symbol names. In that function the constant is also called permuteValue, probably because after the multiplication the bits are permuted around somehow

// -1521134295
var permuteValue = CreateLiteralExpression(factory, hashFactor);

There are some patterns if we view the value in binary: 101001 010101010101010 101001 01001 or 10100 1010101010101010 10100 10100 1. But if we multiply an arbitrary value with that then there are lots of overlapping carries so I couldn't see how it works. The output may also has different number of set bits so it's not really a permutation

You can find the another generator in Roslyn's AnonymousTypeGetHashCodeMethodSymbol which calls the constant HASH_FACTOR

//  Method body:
//
//  HASH_FACTOR = 0xa5555529;
//  INIT_HASH = (...((0 * HASH_FACTOR) + GetFNVHashCode(backingFld_1.Name)) * HASH_FACTOR
//                                     + GetFNVHashCode(backingFld_2.Name)) * HASH_FACTOR
//                                     + ...
//                                     + GetFNVHashCode(backingFld_N.Name)

The real reason for choosing that value is yet still unclear

phuclv
  • 37,963
  • 15
  • 156
  • 475
  • This is great research, thank you. I didn't know hash code generation was in Roslyn, I thought it would be Visual Studio itself. – Sedat Kapanoglu May 01 '20 at 20:57