106

Can people recommend quick and simple ways to combine the hash codes of two objects. I am not too worried about collisions since I have a Hash Table which will handle that efficiently I just want something that generates a code quickly as possible.

Reading around SO and the web there seem to be a few main candidates:

  1. XORing
  2. XORing with Prime Multiplication
  3. Simple numeric operations like multiplication/division (with overflow checking or wrapping around)
  4. Building a String and then using the String classes Hash Code method

What would people recommend and why?

H H
  • 263,252
  • 30
  • 330
  • 514
RobV
  • 28,022
  • 11
  • 77
  • 119

10 Answers10

159

I would personally avoid XOR - it means that any two equal values will result in 0 - so hash(1, 1) == hash(2, 2) == hash(3, 3) etc. Also hash(5, 0) == hash(0, 5) etc which may come up occasionally. I have deliberately used it for set hashing - if you want to hash a sequence of items and you don't care about the ordering, it's nice.

I usually use:

unchecked
{
    int hash = 17;
    hash = hash * 31 + firstField.GetHashCode();
    hash = hash * 31 + secondField.GetHashCode();
    return hash;
}

That's the form that Josh Bloch suggests in Effective Java. Last time I answered a similar question I managed to find an article where this was discussed in detail - IIRC, no-one really knows why it works well, but it does. It's also easy to remember, easy to implement, and easy to extend to any number of fields.

Jon Skeet
  • 1,421,763
  • 867
  • 9,128
  • 9,194
  • Yeah that was my concern about XORing, in the type of data I'm pairing it's fairly unlikely to be pairing too equal items but not impossible – RobV Oct 29 '09 at 22:21
  • 5
    Looks like Dan Bernstein's (or Chris Torek's) hash, just with different constants. Nobody knows why that works well either. – ephemient Oct 29 '09 at 22:27
  • @RobV: I prefer not to have to think if I don't have to. I use this hash even when I *could* get away with XOR, just to avoid having to wonder whether it's safe or not :) – Jon Skeet Oct 29 '09 at 22:48
  • what are the firstField and secondField that you talk about? – SuperString Dec 28 '09 at 21:23
  • @SuperString: the first and second fields you want to hash. You then keep going for the other fields. – Jon Skeet Dec 28 '09 at 21:46
  • 13
    A word of warning, this is (a variation of) the Berstein hash, and because nobody knows why it does well in tests it is not advisable when hashing is critical. See http://www.eternallyconfuzzled.com/tuts/algorithms/jsw_tut_hashing.aspx. Also, you should wrap this code in an `unchecked { }` block. GetHashCode() should not throw any exceptions. – H H Dec 31 '09 at 16:38
  • 1
    This is so mysterious - why 17? why 31? is it important that they are primes? – tofutim Sep 21 '12 at 05:01
  • 6
    @tofutim: The 31 is a good choice as multiplication by 31 can be optimized to a shift and subtract. Whether it *is* optimized that way depends on the platform. As for why those numbers work well for hashing - as Henk says, it's a bit of a mystery. – Jon Skeet Sep 21 '12 at 05:44
  • Why do you prime (no pun intended) the pump with 17 and 31? Why isn't it just `return firstField.GetHashCode() * 31 + secondField.GetHashCode();`? Or is that part of what no one knows? – Patrick M Apr 22 '15 at 20:54
  • @PatrickM: I follow Josh Bloch (or some variant), basically. It does mean that it's harder to end up with a hash code of 0... – Jon Skeet Apr 22 '15 at 20:55
  • But is a hash code of 0 any worse than a hash code of 16337? (17 * 31 * 31, which is amusingly close to 3L337...) I suppose having more multiplications means you have bigger hash codes, on the whole. Is there a similar improvement in distribution? – Patrick M Apr 22 '15 at 21:12
  • @PatrickM: I think hash codes of 0 are more likely to occur between different types, raising the chance of a collision. For example, two objects with different numbers of fields, but where all the fields give a hash code of 0, would have different hashes under this scheme. But this is a pretty tenuous argument, I'm willing to admit :) – Jon Skeet Apr 22 '15 at 21:13
  • The Bernstein hash (DJB, DJB2) uses 33, not 31. And 5381, not 17. But yeah, otherwise very similar. I've used DJB2 for both string hashing and hashcode combination and it's worked fine for me. DJB2 and other algorithms can be found at http://www.cse.yorku.ca/~oz/hash.html – yoyo May 17 '16 at 17:02
  • @Jon can you please comment on Special Sauce's answer? I'd like to know what your thoughts are on it. – rory.ap Jan 26 '18 at 14:54
  • 3
    @rory.ap: I think it's an excellent piece of work, and I'd be perfectly happy to use those numbers. While I hate to admit to using constants "because someone else said to" that's basically what the 17/31 pair is about. – Jon Skeet Jan 26 '18 at 15:02
  • "no body knows" feels like higher beings know it and we cant comprehend it! – M.kazem Akhgary Dec 31 '18 at 21:26
  • 21
    Starting with .NET Core 2.1 you can use the System.HashCode type's Combine method to do this https://learn.microsoft.com/en-us/dotnet/api/system.hashcode.combine – Cosmin Sontu Jan 11 '19 at 09:45
  • Hi, That's seems to be a great way to get a good distribution of values. The only thing that bothers me a lot: the process is order dependant. Which mean that you will get different Hash depending on the order you feed them. Thats is prone to error. You can add a comparison at start on both inner hash to solve the problem (to ensure to calculate the hash based on the smaller or bigger inner hash. But you have to keep in mind that is good only for a pair, not more. – Eric Ouellet Oct 03 '19 at 15:07
  • 1
    @EricOuellet: I would say that being order-dependent is a *good* thing. You don't *want* (1, 0) and (0, 1) to give the same hash code, for example, as that gives more hash collisions. – Jon Skeet Oct 04 '19 at 07:50
  • @JonSkeet, I parlty agree. I would say that for some case it is a good thing but it really depends on the situation. I think that it would be really important to bring a clear distinction on the 2 very specific aspects of the effect on the way to combine hash code. I think that it is important (distinction between order dependant and order independant) and many peoples seems to under estimate it. It is very case specific. – Eric Ouellet Oct 04 '19 at 14:59
  • @EricOuellet: It would only be a good thing for it to be order-independent if equality was also defined in an order-independent way, which is rare in my experience. – Jon Skeet Oct 04 '19 at 15:03
  • 1
    I'm still unable to find `System.HashCode` - I get a compiler error when I try to use it. – Vivraan Oct 03 '20 at 04:43
  • @mushi: Which platform are you targeting? – Jon Skeet Oct 03 '20 at 07:07
  • Android/Unity. I later realised https://github.com/GlitchEnzo/NuGetForUnity exists, so I now have what I need! Except I don't know if it compiles to Android or not. – Vivraan Oct 04 '20 at 12:08
92

If you are using .NET Core 2.1 or later or .NET Framework 4.6.1 or later, consider using the System.HashCode struct to help with producing composite hash codes. It has two modes of operation: Add and Combine.

An example using Combine, which is usually simpler and works for up to eight items:

public override int GetHashCode()
{
    return HashCode.Combine(object1, object2);
}

An example of using Add:

public override int GetHashCode()
{
    var hash = new HashCode();
    hash.Add(this.object1);
    hash.Add(this.object2);
    return hash.ToHashCode();
}

Pros:

  • Part of .NET itself, as of .NET Core 2.1/.NET Standard 2.1 (though, see con below)
    • For .NET Framework 4.6.1 and later, the Microsoft.Bcl.HashCode NuGet package can be used to backport this type.
  • Looks to have good performance and mixing characteristics, based on the work the author and the reviewers did before merging this into the corefx repo
  • Handles nulls automatically
  • Overloads that take IEqualityComparer instances

Cons:

chwarr
  • 6,777
  • 1
  • 30
  • 57
  • 2
    You can reference https://www.nuget.org/packages/Microsoft.Bcl.HashCode to make it working on .NET Framework 4.6.1 or .NET Standard 2.0, officially. – ChrisTorng Jul 31 '20 at 02:45
  • 1
    `System.HashCode` uses xxHash32 (https://github.com/Cyan4973/xxHash) – Maxence Sep 22 '21 at 15:54
  • Does Bcl.HashCode generate stable hashes for strings? – Red Riding Hood Jun 08 '22 at 09:20
  • @RedRidingHood, HashCode is not the right type to use for a hash code that gets persisted, which is what I assume you mean by "stable". HashCode values [are only stable within the same instance of a process](https://learn.microsoft.com/en-us/dotnet/api/system.hashcode?view=net-6.0#remarks). This has been the documented behavior of .NET GetHashCode and related values for a very, very long time. Though, .NET Core started using a per-process seed to enforce this. If you have more questions, I suggest you post it as a stand-alone question. – chwarr Jun 08 '22 at 21:50
  • Thanks, I was aware of the Net Core behavior, I just wasn't sure if the Bcl package was different then what Net Core is using. – Red Riding Hood Jun 09 '22 at 15:44
  • So i've used this before fine for 2 fields, 3 fields... What about 10? 30? – James Aug 14 '23 at 01:25
  • @James: follow the example using `Add` for lots of fields. You "accumulate" values in each call to `Add` and produce a final hashcode by calling `ToHashCode`. – chwarr Aug 15 '23 at 00:31
  • The question is more about effectiveness, and if this simple method breaks down or could be improved when combining many hashes – James Aug 20 '23 at 02:38
  • @James, the [current implementation](https://github.com/dotnet/runtime/blob/5fea131a5f98441d87d2d826eefb054670b0edde/src/libraries/System.Private.CoreLib/src/System/HashCode.cs#L57) use xxHash for mixing the input hashes together. xxHash has good mixing characteristics. I doubt you'll notice anything untoward as the number of hashes you combine increases. This is in contrast to something like a naive XOR hash which will "lose" an input value if the same value is added later. I also doubt that any future implementations would regress this. – chwarr Aug 22 '23 at 02:36
65

While the template outlined in Jon Skeet's answer works well in general as a hash function family, the choice of the constants is important and the seed of 17 and factor of 31 as noted in the answer do not work well at all for common use cases. In most use cases, the hashed values are much closer to zero than int.MaxValue, and the number of items being jointly hashed are a few dozen or less.

For hashing an integer tuple {x, y} where -1000 <= x <= 1000 and -1000 <= y <= 1000, it has an abysmal collision rate of almost 98.5%. For example, {1, 0} -> {0, 31}, {1, 1} -> {0, 32}, etc. If we expand the coverage to also include n-tuples where 3 <= n <= 25, it does less terrible with a collision rate of about 38%. But we can do much better.

public static int CustomHash(int seed, int factor, params int[] vals)
{
    int hash = seed;
    foreach (int i in vals)
    {
        hash = (hash * factor) + i;
    }
    return hash;
}

I wrote a Monte Carlo sampling search loop that tested the method above with various values for seed and factor over various random n-tuples of random integers i. Allowed ranges were 2 <= n <= 25 (where n was random but biased toward the lower end of the range) and -1000 <= i <= 1000. At least 12 million unique collision tests were performed for each seed and factor pair.

After about 7 hours running, the best pair found (where the seed and factor were both limited to 4 digits or less) was: seed = 1009, factor = 9176, with a collision rate of 0.1131%. In the 5- and 6-digit areas, even better options exist. But I selected the top 4-digit performer for brevity, and it peforms quite well in all common int and char hashing scenarios. It also seems to work fine with integers of much greater magnitudes.

It is worth noting that "being prime" did not seem to be a general prerequisite for good performance as a seed and/or factor although it likely helps. 1009 noted above is in fact prime, but 9176 is not. I explicitly tested variations on this where I changed factor to various primes near 9176 (while leaving seed = 1009) and they all performed worse than the above solution.

Lastly, I also compared against the generic ReSharper recommendation function family of hash = (hash * factor) ^ i; and the original CustomHash() as noted above seriously outperforms it. The ReSharper XOR style seems to have collision rates in the 20-30% range for common use case assumptions and should not be used in my opinion.

Special Sauce
  • 5,338
  • 2
  • 27
  • 31
  • 9
    Wow. I love the work that went into this answer. Impressive, well done! – Tom Leys Feb 15 '17 at 20:57
  • Seems to be the best but there is 2 remarks: First and easy one, why not moving "seed" and "factor" to the end and give them a default value (1009 and 9176) where it shuold do the job for most peoples. Second point: like Jon Skeet algo, that is order dependant and you can get a different hash if you feed in a different order. I wonder if it would not be safer to sort that array first in order to ensure to have the same final hash at the end either if you feed the algo in different way. That would become safer. – Eric Ouellet Oct 03 '19 at 15:13
  • 1
    @EricOuellet Because the `params int[] vals` must come at the end of all function arguments, I was not able to make the `seed` and `factor` defaulted parameters. If you don't care about the `params` syntax convenience, you could remove it and then reorder the parameters to allow the defaults as you suggested. – Special Sauce Oct 09 '19 at 21:42
  • @EricOuellet The default hashing for an array should be to consider permutations (this is the more general case) and thus the hash would be different for different orderings (just as the hash for string "abc" is different than the hash for "acb"). If you specifically wanted a hash function for combinations only, you should probably accept a `HashSet` argument (assuming no duplicates). Otherwise you could rename the function to `CustomHashCombination()` to prevent confusion, and do the internal pre-sort as suggested. – Special Sauce Oct 09 '19 at 21:44
  • 1
    I like this answer, but I wouldn't use `params` because it has to allocate an array on every call. So it may be faster computations-wise, but it creates GC pressure for later. – Gru Jan 20 '20 at 09:13
  • What were the 5 and 6-digit numbers that got generated? Will 1009 and 9176 work well for hashing in a table of 2^28 (`int.MaxValue` / 16) values? – Vivraan Dec 10 '20 at 07:56
  • Not sure if I can trust these constants considering this test has a hard limit [-1000..1000] range for `i`. Wouldn't it be better, if possible, to run it allowing all int values, but using random ints from a bell-curve distribution centered in `0`? You could also add ints coming from GetHashCode of floats and doubles in a bell-curve distribution centered in `0.0`. It'd be great if you published your testing algorithm in a gist or something so people can understand it better (and maybe tweak it for specific cases). – geekley Dec 21 '22 at 05:21
  • @geekley Larger int values are not really a problem because they tend to mix even faster (over the unchecked int.MaxValue modulo). You could make seed and factor very large random constants, and get great mixing / low collisions immediately with almost any seed & factor values you tried. The whole point of my exercise was to find a seed & factor combination that worked well **with a low number of digits** to make a nice readable, convenient snippet. – Special Sauce Dec 30 '22 at 21:13
26

Use the combination logic in tuple. The example is using c#7 tuples.

(field1, field2).GetHashCode();
Yepeekai
  • 2,545
  • 29
  • 22
  • Nice idea though I suspect that this might have issues with GC churn since you are implicitly creating a short lived object – RobV Nov 27 '17 at 10:50
  • 11
    @RobV Tuples are value types, so they are stack allocated and exert no GC pressure. – Mike Pedersen Feb 13 '18 at 10:34
  • 4
    One problem... (0,1,2).GetHashCode() and (0,0,1,2).GetHashCode() both yield the same value: 35. Whereas the method in the most upvoted answer yields unique values 0, 1, 2: 506480 and 0, 0, 1, 2: 15699890 – dynamichael Jun 17 '18 at 22:23
  • 2
    Hashcodes aren't guaranteed to be unique. You found a case where it isn't... It doesn't make it a bad choice unless there are a lot of collisions(in that case it would be a good idea to submit a bug). I personally prefer to use something from the framework instead of implementing something different. – Yepeekai Jun 24 '18 at 15:08
  • 3
    It's actually `ValueTuple` type of a struct ([MSDN](https://learn.microsoft.com/en-us/dotnet/csharp/tuples)). Be careful that `Tuple` type is a class and it has GC pressure. I like this way. Internally it's similar to @Stipo 's post but very easy to understand and to review. It would be good choice in most cases. – cactuaroid Aug 16 '18 at 10:05
  • Best answer, works on both .net 4.5 and on core. Use the System.ValueTuple nuget package on .net 4.5, and on core it's built in. – Andriy Volkov Apr 30 '21 at 14:56
21

I presume that .NET Framework team did a decent job in testing their System.String.GetHashCode() implementation, so I would use it:

// System.String.GetHashCode(): http://referencesource.microsoft.com/#mscorlib/system/string.cs,0a17bbac4851d0d4
// System.Web.Util.StringUtil.GetStringHashCode(System.String): http://referencesource.microsoft.com/#System.Web/Util/StringUtil.cs,c97063570b4e791a
public static int CombineHashCodes(IEnumerable<int> hashCodes)
{
    int hash1 = (5381 << 16) + 5381;
    int hash2 = hash1;

    int i = 0;
    foreach (var hashCode in hashCodes)
    {
        if (i % 2 == 0)
            hash1 = ((hash1 << 5) + hash1 + (hash1 >> 27)) ^ hashCode;
        else
            hash2 = ((hash2 << 5) + hash2 + (hash2 >> 27)) ^ hashCode;

        ++i;
    }

    return hash1 + (hash2 * 1566083941);
}

Another implementation is from System.Web.Util.HashCodeCombiner.CombineHashCodes(System.Int32, System.Int32) and System.Array.CombineHashCodes(System.Int32, System.Int32) methods. This one is simpler, but probably doesn't have such a good distribution as the method above:

// System.Web.Util.HashCodeCombiner.CombineHashCodes(System.Int32, System.Int32): http://referencesource.microsoft.com/#System.Web/Util/HashCodeCombiner.cs,21fb74ad8bb43f6b
// System.Array.CombineHashCodes(System.Int32, System.Int32): http://referencesource.microsoft.com/#mscorlib/system/array.cs,87d117c8cc772cca
public static int CombineHashCodes(IEnumerable<int> hashCodes)
{
    int hash = 5381;

    foreach (var hashCode in hashCodes)
        hash = ((hash << 5) + hash) ^ hashCode;

    return hash;
}
Stipo
  • 4,566
  • 1
  • 21
  • 37
3

This is a repackaging of Special Sauce's brilliantly researched solution.
It makes use of Value Tuples (ITuple).
This allows defaults for the parameters seed and factor.

public static int CombineHashes(this ITuple tupled, int seed=1009, int factor=9176)
{
    var hash = seed;

    for (var i = 0; i < tupled.Length; i++)
    {
        unchecked
        {
            hash = hash * factor + tupled[i].GetHashCode();
        }
    }

    return hash;
}

Usage:

var hash1 = ("Foo", "Bar", 42).CombineHashes();    
var hash2 = ("Jon", "Skeet", "Constants").CombineHashes(seed=17, factor=31);
3dGrabber
  • 4,710
  • 1
  • 34
  • 42
0

If you're looking for speed and don't have too many collisions, then XOR is fastest. To prevent a clustering around zero, you could do something like this:

finalHash = hash1 ^ hash2;
return finalHash != 0 ? finalHash : hash1;

Of course, some prototyping ought to give you an idea of performance and clustering.

Ed Power
  • 8,310
  • 3
  • 36
  • 42
-1

If your input hashes are the same size, evenly distributed and not related to each other then an XOR should be OK. Plus it's fast.

The situation I'm suggesting this for is where you want to do

H = hash(A) ^ hash(B); // A and B are different types, so there's no way A == B.

of course, if A and B can be expected to hash to the same value with a reasonable (non-negligible) probability, then you should not use XOR in this way.

geofftnz
  • 9,954
  • 2
  • 42
  • 50
  • how would I tell whether my hash codes are evenly distributed, is there an easy benchmark to do for this? I know the collision rate is pretty low but does that necessarily correspond to an even distribution? – RobV Oct 29 '09 at 22:04
-4

Assuming you have a relevant toString() function (where your different fields shall appear), I would just return its hashcode:

this.toString().hashCode();

This is not very fast, but it should avoid collisions quite well.

Thomas Hugel
  • 146
  • 1
  • 5
-13

I would recommend using the built-in hash functions in System.Security.Cryptography rather than rolling your own.

richardtallent
  • 34,724
  • 14
  • 83
  • 123
  • 14
    No, They have a very different purpose and break the rule that GetHashCode should be fast. – H H Oct 30 '09 at 10:20