9

Is it possible one and the same object, particularly a string or any primitive or very simple type (like a struct), to produce different values of the .GetHashCode() method when invoked on different machines?

For instance, is it possible for the expression "Hello World".GetHashCode() to produce a different value on a different machine. I am primarily asking for C#.NET but I suppose this might apply to Java or even other languages?

Edit:

As pointed from answers and comments below, it is known to me that .GetHashCode() can be overriden, and there is no guarantee for the result it produces between different version of the framework. Therefore it is important to clarify that I have simple types in mind (which cannot be inherited, therefore GetHashCode() be overriden) and I am using the same versions of the framework on all machines.

Ivaylo Slavov
  • 8,839
  • 12
  • 65
  • 108
  • Yes, as per [the documentation](http://msdn.microsoft.com/en-us/library/system.object.gethashcode.aspx). – Cody Gray - on strike Jan 12 '12 at 16:00
  • 2
    For your enjoyment -- http://blogs.msdn.com/b/ericlippert/archive/2011/02/28/guidelines-and-rules-for-gethashcode.aspx – Austin Salonen Jan 12 '12 at 16:09
  • 1
    The difference exists for strings only between x86 and 64. Search for "Different return values for strings on x86 versus x64" in the link provided by @CodyGray – digEmAll Jan 12 '12 at 16:15
  • @IvayloSlavov: inspecting the code in the GetHashCode() function for strings, values are cast to pointers that have different sizes in x86 and x64 machines, so I think that's the reason why you get different values. Can you confirm that the 2 machines are using 2 different architectures ? – digEmAll Jan 12 '12 at 16:34
  • 1
    @digEmAll, I cannot, Actually my question was rather hypothetical than backed with real code. I am thinking of a way to implement internal load balancing on a windows distributed application - I have a service deployed on different local servers, and I need a good and stable hash code (across machines and startups), so I can determine the concrete service to run by a string token. And I was considering the built-in hash to be probably the fastest and easiest way, but it seems I've been wrong. – Ivaylo Slavov Jan 12 '12 at 16:38
  • 7
    @IvayloSlavov: If that is your requirement then GetHashCode is absolutely, positively the wrong thing to use. Use GetHashCode for one thing and one thing only: to balance a hash table. If you need a hash code for some other purpose, write your own hash code algorithm that is suitable for that purpose. – Eric Lippert Jan 12 '12 at 16:40
  • @EricLippert Thanks, I am already considering this as the only option. Thanks for the fast and accurate responses! – Ivaylo Slavov Jan 12 '12 at 16:42
  • 2
    @IvayloSlavov: To get a stable hashing for strings use methods and classes of the namespace: [System.Security.Cryptography](http://msdn.microsoft.com/it-it/library/system.security.cryptography%28v=VS.90%29.aspx). For example you could use [MD5](http://msdn.microsoft.com/it-it/library/system.security.cryptography.md5cryptoserviceprovider(v=VS.90).aspx) – digEmAll Jan 12 '12 at 16:42
  • 1
    @IvayloSlavov I've done similar to this with hashes that were under my control and it's worked well. You do need to account for the fact that unless you can create a perfect hash (google "perfect hash" if you aren't familiar with the concept) then there is **always** a risk of collision that you need to handle. That said, if there's a string token identifying the service, why not just pass that between machines? Even if hashing is used internal to the server, it doesn't have escape that context. – Jon Hanna Jan 12 '12 at 20:44

2 Answers2

16

Short answer: Yes.

But short answers are no fun, are they?

When you are implementing GetHashCode() you have to make the following guarantee:

When GetHashCode() is called on another object that should be considered equal to this, in this App Domain, the same value will be returned.

That's it. There's some things you really need to try to do (spread the bits around with non-equal objects as much as possible, but don't take so long about it that it outweighs all the benefits of hashing in the first place) and your code will suck if you don't do so, but it won't actually break. It will break if you don't go that far, because then e.g.:

dict[myObj] = 3;
int x = dict[myObj];//KeyNotFoundException

Okay. If I'm implementing GetHashCode(), why might I go further than that, and why might I not?

First, why might I not?

Maybe it's a slightly different version of the assembly and I improved (or at least attempted to) in between builds.

Maybe one is 32-bit and one is 64-bit and I was going nuts for efficiency and chose a different algorithm for each to make use of the different word sizes (this is not unheard of, especially when hashing objects like collections or strings).

Maybe some element I'm deciding to consider in deciding on what constitutes "equal" objects is itself varying from system to system in this sort of way.

Maybe I actually deliberately introduce a different seed with different builds to catch any case where a colleague is mistakenly depending upon my hash code! (I've heard MS do this with their implementation for string.GetHashCode(), but can't remember whether I heard that from a credible or credulous source).

Mainly though, it'll be one of the first two reasons.

Now, why might I give such a guarantee?

Most likely if I do, it'll be by chance. If an element can be compared for equality on the basis of a single integer id alone, then that's what I'm going to use as my hash-code. Anything else will be more work for a less good hash. I'm not likely to change this, so I might.

The other reason why I might, is that I want that guarantee myself. There's nothing to say I can't provide it, just that I don't have to.


Okay, let's get to something practical. There are cases where you may want a machine-independent guarantee. There are cases where you may want the opposite, which I'll come to in a bit.

First, check your logic. Can you handle collisions? Good, then we'll begin.

If it's your own class, then implement so as to provide such a guarantee, document it, and you're done.

If it's not your class, then implement IEqualityComparer<T> in such a way as to provide it. For example:

public class ConsistentGuaranteedComparer : IEqualityComparer<string>
{
  public bool Equals(string x, string y)
  {
    return x == y;
  }
  public int GetHashCode(string obj)
  {
    if(obj == null)
      return 0;
    int hash = obj.Length;
    for(int i = 0; i != obj.Length; ++i)
      hash = (hash << 5) - hash + obj[i];
    return hash;
  }
}

Then use this instead of the built-in hash-code.

There's an interesting case where we may want the opposite. If I can control the set of strings you are hashing, then I can pick a bunch of strings with the same hash-code. Your hash-based collection's performance will hit the worse-case and be pretty atrocious. Chances are I can keep doing this faster than you can deal with it, so it can be a denial of service attack. There's not many cases where this happens, but an important one is if you're handling XML documents I send and you can't just rule out some elements (a lot of formats allow for freedom of elements within them). Then the NameTable inside your parser will be hurt. In this case we create a new hash mechanism each time:

public class RandomComparer : IEqualityComparer<string>
{
  private int hashSeed = Environment.TickCount;
  public bool Equals(string x, string y)
  {
    return x == y;
  }
  public int GetHashCode(string obj)
  {
    if(obj == null)
      return 0;
    int hash = hashSeed + obj.Length;
    for(int i = 0; i != obj.Length; ++i)
      hash = hash << 5 - hash + obj[i];
    hash += (hash <<  15) ^ 0xffffcd7d;
    hash ^= (hash >>> 10);
    hash += (hash <<   3);
    hash ^= (hash >>>  6);
    hash += (hash <<   2) + (hash << 14);
    return hash ^ (hash >>> 16)
  }
}

This will be consistent within a given use, but not consistent from use to use, so an attacker can't construct input to force it to be DoSsed. Incidentally, NameTable doesn't use an IEqualityComparer<T> because it wants to deal with char-arrays with indices and lengths without constructing a string unless necessary, but it does do something similar.

Incidentally, in Java the hash-code for string is specified and won't change, but this may not be the case for other classes.

Edit: Having done some research into the overall quality of the approach taken in ConsistentGuaranteedComparer above, I'm no longer happy with having such algorithms in my answers; while it serves to describe the concept, it doesn't have as good a distribution as one might like. Of course, if one has already implemented such a thing, then one can't change it without breaking the guarantee, but if I'd now recommend using this library of mine, written after said research as follows:

public class ConsistentGuaranteedComparer : IEqualityComparer<string>
{
  public bool Equals(string x, string y)
  {
    return x == y;
  }
  public int GetHashCode(string obj)
  {
    return obj.SpookyHash32();
  }
}

That for RandomComparer above isn't as bad, but can also be improved:

public class RandomComparer : IEqualityComparer<string>
{
  private int hashSeed = Environment.TickCount;
  public bool Equals(string x, string y)
  {
    return x == y;
  }
  public int GetHashCode(string obj)
  {
    return obj.SpookyHash32(hashSeed);
  }
}

Or for even harder predictability:

public class RandomComparer : IEqualityComparer<string>
{
  private long seed0 = Environment.TickCount;
  private long seed1 = DateTime.Now.Ticks;
  public bool Equals(string x, string y)
  {
    return x == y;
  }
  public int GetHashCode(string obj)
  {
    return obj.SpookyHash128(seed0, seed1).GetHashCode();
  }
}
Jon Hanna
  • 110,372
  • 10
  • 146
  • 251
  • 2
    This goes far beyond what I've asked. I am very glad to have this info and really appreciate your effort. Thank you – Ivaylo Slavov Jan 12 '12 at 18:30
  • 1
    As I said, short answers are no fun :) – Jon Hanna Jan 12 '12 at 20:32
  • 1
    I was looking for information about NameTable's implementation of hashing - wondering why it was different than GetHashCode() and this answer covered that. Nice work! – Clay Dec 22 '15 at 14:31
  • 1
    *"I've heard MS do this with their implementation for `string.GetHashCode()`..."* Now that the .NET source is available you [now have a authoritative source](http://referencesource.microsoft.com/mscorlib/system/string.cs.html#0a17bbac4851d0d4) that shows that they do actually use random haschodes in some builds if the `FEATURE_RANDOMIZED_STRING_HASHING` build variable is set. Also, if it is a `DEBUG` build they also do `hash1 ^= ThisAssembly.DailyBuildNumber;` to make sure no one is doing anything stupid like trying to persist hash values, – Scott Chamberlain Sep 04 '16 at 22:39
1

It will produce different result even on the same machine on different runs.

So it basically can be used (and it is actually used) to check something during the current run of the program, but there is no sence to store it, to check something against it after. Cause the number you get is generated by runtime.

EDIT

For specific case of a string it will produce the same result even on different machines, except the case when machines have different architecture.

Tigran
  • 61,654
  • 8
  • 86
  • 123
  • Can you clarify more. Please, take into account the revisiting of my question - I have clarified that I ask for non-inheritable simple or primitive types only. Would you be able to share some links for more information too? Thanks in advance – Ivaylo Slavov Jan 12 '12 at 16:08
  • 3
    That's not true. For strings you will get different value only if you change platform (x86 vs x64) otherwise GetHashValue returns always the same value. – digEmAll Jan 12 '12 at 16:12
  • Still wrong. Same string, same framework version, same result. It's not a rule that this would have to be the case though. – Jon Hanna Jan 12 '12 at 17:00
  • Indeed, as written this would be broken. If a constant string produced a consistent hash-code and a non-constant didn't, then a constant string would have a different hash-code to an equivalent non-constant, which would not be allowed. – Jon Hanna Jan 12 '12 at 17:23
  • @JonHanna: may be here I used "constant" word in inappropriate way. The strings are immutable in C#, so they are all "constant". That was the meaning. In other words: string "Hello" will produce the same `hashcode` on my machine and on yours, if we share the same processor architecture. – Tigran Jan 12 '12 at 18:17