90

Given that collections like System.Collections.Generic.HashSet<> accept null as a set member, one can ask what the hash code of null should be. It looks like the framework uses 0:

// nullable struct type
int? i = null;
i.GetHashCode();  // gives 0
EqualityComparer<int?>.Default.GetHashCode(i);  // gives 0

// class type
CultureInfo c = null;
EqualityComparer<CultureInfo>.Default.GetHashCode(c);  // gives 0

This can be (a little) problematic with nullable enums. If we define

enum Season
{
  Spring,
  Summer,
  Autumn,
  Winter,
}

then the Nullable<Season> (also called Season?) can take just five values, but two of them, namely null and Season.Spring, have the same hash code.

It is tempting to write a "better" equality comparer like this:

class NewNullEnumEqComp<T> : EqualityComparer<T?> where T : struct
{
  public override bool Equals(T? x, T? y)
  {
    return Default.Equals(x, y);
  }
  public override int GetHashCode(T? x)
  {
    return x.HasValue ? Default.GetHashCode(x) : -1;
  }
}

But is there any reason why the hash code of null should be 0?

EDIT/ADDITION:

Some people seem to think this is about overriding Object.GetHashCode(). It really is not, actually. (The authors of .NET did make an override of GetHashCode() in the Nullable<> struct which is relevant, though.) A user-written implementation of the parameterless GetHashCode() can never handle the situation where the object whose hash code we seek is null.

This is about implementing the abstract method EqualityComparer<T>.GetHashCode(T) or otherwise implementing the interface method IEqualityComparer<T>.GetHashCode(T). Now, while creating these links to MSDN, I see that it says there that these methods throw an ArgumentNullException if their sole argument is null. This must certainly be a mistake on MSDN? None of .NET's own implementations throw exceptions. Throwing in that case would effectively break any attempt to add null to a HashSet<>. Unless HashSet<> does something extraordinary when dealing with a null item (I will have to test that).

NEW EDIT/ADDITION:

Now I tried debugging. With HashSet<>, I can confirm that with the default equality comparer, the values Season.Spring and null will end in the same bucket. This can be determined by very carefully inspecting the private array members m_buckets and m_slots. Note that the indices are always, by design, offset by one.

The code I gave above does not, however, fix this. As it turns out, HashSet<> will never even ask the equality comparer when the value is null. This is from the source code of HashSet<>:

    // Workaround Comparers that throw ArgumentNullException for GetHashCode(null).
    private int InternalGetHashCode(T item) {
        if (item == null) { 
            return 0;
        } 
        return m_comparer.GetHashCode(item) & Lower31BitMask; 
    }

This means that, at least for HashSet<>, it is not even possible to change the hash of null. Instead, a solution is to change the hash of all the other values, like this:

class NewerNullEnumEqComp<T> : EqualityComparer<T?> where T : struct
{
  public override bool Equals(T? x, T? y)
  {
    return Default.Equals(x, y);
  }
  public override int GetHashCode(T? x)
  {
    return x.HasValue ? 1 + Default.GetHashCode(x) : /* not seen by HashSet: */ 0;
  }
}
Jeppe Stig Nielsen
  • 60,409
  • 11
  • 110
  • 181
  • possible duplicate of [What should GetHashCode return when object's identifier is null?](http://stackoverflow.com/questions/5078149/what-should-gethashcode-return-when-objects-identifier-is-null) – Adriano Repetti May 23 '12 at 15:47
  • 26
    Why should the hash code for null not be zero? A hash collision is not the end of the world, you know. – Hot Licks May 23 '12 at 15:50
  • 3
    Except that it's a well known, quite common, collision. Not that it's *bad* or even that major of a problem, it's just easily avoidable – Chris Pfohl May 23 '12 at 15:52
  • 8
    lol why am I thinking "if the .NET framework jumps off a bridge, would you follow it?"... – Adam Houldsworth May 23 '12 at 15:55
  • 3
    Just out of curiosity, what would a null season be? – SwDevMan81 May 23 '12 at 15:55
  • 1
    Personally, this is why I always give my enums a value of "Empty" or "Unknown" as the first value. In this way, my Season enum would never represent Spring without me explicitly setting the value, thus nullifying the issue. – Chris May 23 '12 at 16:01
  • @Adam, how high's the bridge? ;) – Chris Pfohl May 23 '12 at 16:11
  • 1
    I know that hash collisions are sometimes unavoidable, and I know that a hash collision will not be the end of the universe, as such. But any well-written enum type **must** specify a name for the constant `0` of that type. Otherwise the default value `default(T)` of the **non-nullable** enum would be without name, and that is considered bad enum design. So this means that every nullable enum `T?`, no matter how "small" that enum is, must lead to a hash code collision. Isn't that unfortunate? – Jeppe Stig Nielsen May 23 '12 at 16:15
  • 1
    @JeppeStigNielsen It's less unfortunate in that enums tend to not be *that* large, and are also infrequently keys in a hash table. Even if they are hashed, because there are so few in total even the worst case ( O(n) searching as everything goes to the same bucket ) is not likely to noticeably affect the program. – Servy May 23 '12 at 16:42
  • I’d love to see a benchmark analysing how much impact this actually has. I could imagine some cases where it’s actually relevant. But in those you’d probably use a custom dictionary implementation using direct array access via the enum’s values. – Konrad Rudolph May 23 '12 at 20:32
  • 1
    What number you like to get for the null ? – Aristos May 24 '12 at 09:13
  • 1
    The basic point is that the specific hashcode for null, and the fact that it may end up "colliding" with a element in an enumeration, is so vanishingly trivial a concern that it's not worth the electronic paper we're spending on it. It's a non-issue -- zero performance problem (or as close to zero as one could get). – Hot Licks May 24 '12 at 15:35
  • 1
    To the first part, we also note that, unsurprisingly, `System.Runtime.CompilerServices.RuntimeHelpers.GetHashCode(null)` also gives `0`. – Jeppe Stig Nielsen May 04 '13 at 16:48
  • This question has been an eye opener for me.. I will never return 0 for null hash codes anymore. Considering 0 is a pretty common hash for int 0, bool false, nulls, enum default value etc. Performance performance! !! I'm evil anyway... – nawfal Jan 13 '14 at 06:24
  • However, I see, with `string`, that while `EqualityComparer.Default.GetHashCode(null)` uses zero as the hash for the null string, things like `StringComparer.Ordinal.GetHashCode(null)` throw `ArgumentNullException`. So that is a special case. – Jeppe Stig Nielsen Jan 23 '15 at 16:07
  • @JeppeStigNielsen: Collisions on the zero hash are a non-issue in a chain-bucket hash table, especially when comparisons involving one of the items will be cheap anyway. A bigger issue with zero hashes would be with code that tries to cache hash values but can't cache zero. For that reason, I'd suggest that it should be considered fine for hash functions to return zero when they can do so quickly, but slower hash functions should try to avoid it. – supercat Feb 19 '15 at 04:06

9 Answers9

27

So long as the hash code returned for nulls is consistent for the type, you should be fine. The only requirement for a hash code is that two objects that are considered equal share the same hash code.

Returning 0 or -1 for null, so long as you choose one and return it all the time, will work. Obviously, non-null hash codes should not return whatever value you use for null.

Similar questions:

GetHashCode on null fields?

What should GetHashCode return when object's identifier is null?

The "Remarks" of this MSDN entry goes into more detail around the hash code. Poignantly, the documentation does not provide any coverage or discussion of null values at all - not even in the community content.

To address your issue with the enum, either re-implement the hash code to return non-zero, add a default "unknown" enum entry equivalent to null, or simply don't use nullable enums.

Interesting find, by the way.

Another problem I see with this generally is that the hash code cannot represent a 4 byte or larger type that is nullable without at least one collision (more as the type size increases). For example, the hash code of an int is just the int, so it uses the full int range. What value in that range do you choose for null? Whatever one you pick will collide with the value's hash code itself.

Collisions in and of themselves are not necessarily a problem, but you need to know they are there. Hash codes are only used in some circumstances. As stated in the docs on MSDN, hash codes are not guaranteed to return different values for different objects so shouldn't be expected to.

Community
  • 1
  • 1
Adam Houldsworth
  • 63,413
  • 11
  • 150
  • 187
  • I don't think the questions you link are completely similar. When you are overriding `Object.GetHashCode()` in your own class (or struct), you know that this code will only be hit when people actually have an **instance** of your class. That instance can't be `null`. That's why you don't start your override of `Object.GetHashCode()` with `if (this == null) return -1;` There's a difference between "being `null`" and "being an object possessing some fields that are `null`". – Jeppe Stig Nielsen May 23 '12 at 21:09
  • You say: _Obviously, non-null hash codes should not return whatever value you use for null._ That would be ideal, I agree. And that is the reason why I asked my question in the first place, because whenever we write an enum `T`, then `(T?)null` and `(T?)default(T)` will have the same hash code (in the current implementation of .NET). That could be changed if the implementors of .NET changed either the hash code of `null` **or** the hash code algorithm of the `System.Enum`. – Jeppe Stig Nielsen May 23 '12 at 21:17
  • I agree the links were for null internal fields. You mention it is for IEqualityComparer, in your implementation the hash code is still specific to a type so you are still in the same situation, consistency for the type. Returning the same hash code for nulls of any type won't matter as nulls don't have a type. – Adam Houldsworth May 24 '12 at 06:10
  • 1
    Note: I updated my question twice. It turns out that (at least with `HashSet<>`) it does not work to change the hash code of `null`. – Jeppe Stig Nielsen May 26 '12 at 09:18
6

It doesn't have to be zero -- you could make it 42 if you wanted to.

All that matters is consistency during the execution of the program.

It's just the most obvious representation, because null is often represented as a zero internally. Which means, while debugging, if you see a hash code of zero, it might prompt you to think, "Hmm.. was this a null reference issue?"

Note that if you use a number like 0xDEADBEEF, then someone could say you're using a magic number... and you kind of would be. (You could say zero is a magic number too, and you'd be kind of right... except that it's so widely used as to be somewhat of an exception to the rule.)

user541686
  • 205,094
  • 128
  • 528
  • 886
6

Bear in mind that the hash code is used as a first-step in determining equality only, and [is/should]never (be) used as a de-facto determination as to whether two objects are equal.

If two objects' hash codes are not equal then they are treated as not equal (because we assume that the unerlying implementation is correct - i.e. we don't second-guess that). If they have the same hash code, then they should then be checked for actual equality which, in your case, the null and the enum value will fail.

As a result - using zero is as good as any other value in the general case.

Sure, there will be situations, like your enum, where this zero is shared with a real value's hash code. The question is whether, for you, the miniscule overhead of an additional comparison causes problems.

If so, then define your own comparer for the case of the nullable for your particular type, and ensure that a null value always yields a hash code that is always the same (of course!) and a value that cannot be yielded by the underlying type's own hash code algorithm. For your own types, this is do-able. For others - good luck :)

Andras Zoltan
  • 41,961
  • 13
  • 104
  • 160
4

Good question.

I just tried to code this:

enum Season
{
  Spring,
  Summer,
  Autumn,
  Winter,
}

and execute this like this:

Season? v = null;
Console.WriteLine(v);

it returns null

if I do, instead normal

Season? v = Season.Spring;
Console.WriteLine((int)v);

it return 0, as expected, or simple Spring if we avoid casting to int.

So.. if you do the following:

Season? v = Season.Spring;  
Season? vnull = null;   
if(vnull == v) // never TRUE

EDIT

From MSDN

If two objects compare as equal, the GetHashCode method for each object must return the same value. However, if two objects do not compare as equal, the GetHashCode methods for the two object do not have to return different values

In other words: if two objects have same hash code that doesn't mean that they are equal, cause real equality is determined by Equals.

From MSDN again:

The GetHashCode method for an object must consistently return the same hash code as long as there is no modification to the object state that determines the return value of the object's Equals method. Note that this is true only for the current execution of an application, and that a different hash code can be returned if the application is run again.

Tigran
  • 61,654
  • 8
  • 86
  • 123
  • 6
    a collision, by definition, means two unequal objects have the same hashcode. You have demonstrated that the objects are not equal. Now do they have the same hash code? According to the OP they do, meaning this is a collision. Now, it's not the end of the world to have a collision, it's simply a more likely collision than if null hashed to something other than 0, which hurts performance. – Servy May 23 '12 at 15:55
  • 1
    So what does your answer actually say? You say that Season.Spring is not equal to null. Well, that's not wrong, but it doesn't really answer the question in any way now does it. – Servy May 23 '12 at 16:00
  • 2
    @Servy: the question says: that why I have *same* hascode for 2 different object (*null* and *Spring*). So the answer is that there is not *collision* cause even having the same hashcode, they are not equal, by the way. – Tigran May 23 '12 at 16:03
  • Yes, it is a collision. A collision, by definition, is two objects with the same hashcode that are not equal. As I said before, it's not the end of the world, but because this is a rather likely collision it can harm performance. It doesn't break everything entirely, but it's not good either. The OP is asking if this is a significant problem, if his solution solves it, or if there is a better solution. Your "answer" answers none of those questions, it only really restates information either stated in the question or that the questioner took for granted. – Servy May 23 '12 at 16:30
  • @Servy: don't ser **any** note about performance in the question and the most voted answer stands **exactly** what I do. Last phrade is: "But is there any reason why the hash code of null should be 0? " – Tigran May 23 '12 at 17:14
  • And what is your answer to that question? Is it yes, is it no? You are correct, the OP doesn't mention performance at all. He asks, "what are the reasons to ...". The reason that he considers going out of his way to avoid null hashing to zero is to avoid collisions, and the effect of avoiding collisions is to improve performance, therefore the answer to the question (as opposed to just a restating of it) should discuss performance. – Servy May 23 '12 at 17:41
  • @Servy: last comment promise:) the question is: why null.GetHashCode() ==0. Answer: why not ? Cause the fact that null.GetHashCode() == Sptring.GetHashCode() doesn't mean the values are equal and there *should not* be any confusion. Questions: 1. Does it significally harm performance of my application, or better, *may* it significally harm may performance? Answer: it depends how you use it, and should be measure. 2. Why MS implemented in that way? Have no idea. – Tigran May 23 '12 at 17:46
  • 3
    "Answer: why not ?" Well, the OP pre-emptively answered your question of "why not". It is more likely to cause collisions than another number. He was wondering if there was a reason 0 was chosen, and nobody has so far answered that. – Servy May 23 '12 at 17:56
  • 1
    This answer contains nothing that the OP doesn’t already know, evident from the way the question was asked. – Konrad Rudolph May 23 '12 at 20:29
  • @Servy -- How is having to step through possibly one more link in a hash table chain going to create performance problems? It's bad when one bucket has 5-10 times more entries than the average bucket (and the average bucket has 5-10), but the time required to visit a single additional link is vanishingly small. (And, in the case of four values and the null, it would almost certainly be faster to just do a linear search than to use a hashtable in the first place.) – Hot Licks May 23 '12 at 23:32
  • @KonradRudolph: there is none (except framework designer) to answer fully on this question. My answer: why not? It doens't create *any* problem. That is. I hoped this meaning is clear from my post.Or let's say in another way: don't worry, it doesn't create a problem. It could be done per default like 0, -1, IntPtr.Zero, IntPtr.MinValue.. whatever, but (for some reason) was choosen 0. – Tigran May 24 '12 at 05:38
  • @HotLicks I agree, and have already stated in this question that I don't believe the performance implications are significant enough to actually worry about, but that it clearly what the OP is interested in, so a proper answer should address the issue. – Servy May 24 '12 at 13:38
  • @Tigran Your answer was originally just wrong, which obviously causes problems. It has since been edited to not be factually false, but as Konrad says, it contains no information that isn't already in the OP, so it is not in any way helpful. An answer of "why not" should just be a comment (if anything). If you were to explain why it doesn't matter what they picked, and so picking 0 is just fine, then that would actually answer the question. You didn't say that. – Servy May 24 '12 at 13:42
  • @Servy: sorry, bu can not agree on that it was originally wrong. My edit just **add** a stuff and does not **change** anything. I *can* agree, by the way, on fact that I didn't clearly manifest my thoughts. – Tigran May 24 '12 at 13:45
  • @Tigran You originally said that there was no collision happening. That is factually false. There is indeed a collision happening. It is the textbook definition of a collision in fact. That factual inaccuracy has since been removed from the answer. There is a revision log, so I don't know why you'd dispute something that's recorded... – Servy May 24 '12 at 13:47
  • @Servy: ah ok, you are right. But I explained, hopefully, that it's actually how **I** interpreted "collision". – Tigran May 24 '12 at 13:50
  • @Tigran Considering you no longer use the word at all, no. I don't see how you have interpreted it. I know I defined it for you and you said I was "correct". I have not seen you redefine it since. – Servy May 24 '12 at 13:52
4

But is there any reason why the hash code of null should be 0?

It could have been anything at all. I tend to agree that 0 wasn't necessarily the best choice, but it's one that probably leads to fewest bugs.

A hash function absolutely must return the same hash for the same value. Once there exists a component that does this, this is really the only valid value for the hash of null. If there were a constant for this, like, hm, object.HashOfNull, then someone implementing an IEqualityComparer would have to know to use that value. If they don't think about it, the chance they'll use 0 is slightly higher than every other value, I reckon.

at least for HashSet<>, it is not even possible to change the hash of null

As mentioned above, I think it's completely impossible full stop, just because there exist types which already follow the convention that hash of null is 0.

Roman Starkov
  • 59,298
  • 38
  • 251
  • 324
  • When one implements the method `EqualityComparer.GetHashCode(T)` for some particular type `T` that allows `null`, one has to do **something** when the argument is `null`. You could (1) throw an `ArgumentNullException`, (2) return `0`, or (3) return something else. I take your answer for an recommendation always to return `0` in that situation? – Jeppe Stig Nielsen Jun 12 '12 at 15:15
  • @JeppeStigNielsen I'm not sure about throw vs return, but if you do choose to return, then definitely zero. – Roman Starkov Jun 12 '12 at 20:57
2

It is 0 for the sake of simplicity. There is no such hard requirement. You only need to ensure the general requirements of hash coding.

For example, you need to make sure that if two objects are equal, their hashcodes must always be equal too. Therefore, different hashcodes must always represent different objects (but it's not necessarily true vice versa: two different objects may have the same hashcode, even though if this happens often then this is not a good quality hash function -- it doesn't have a good collision resistance).

Of course, I restricted my answer to requirements of mathematical nature. There are .NET-specific, technical conditions as well, which you can read here. 0 for a null value is not among them.

Thomas Calc
  • 2,994
  • 3
  • 30
  • 56
1

So this could be avoided by using an Unknown enum value (although it seems a bit weird for a Season to be unknown). So something like this would negate this issue:

public enum Season
{
   Unknown = 0,
   Spring,
   Summer,
   Autumn,
   Winter
}

Season some_season = Season.Unknown;
int code = some_season.GetHashCode(); // 0
some_season = Season.Autumn;
code = some_season.GetHashCode(); // 3

Then you would have unique hash code values for each season.

SwDevMan81
  • 48,814
  • 22
  • 151
  • 184
1

Personally I find using nullable values a bit awkward and try to avoid them whenever I can. Your issue is just another reason. Sometimes they are very handy though but my rule of thumb is not to mix value types with null if possible simply because these are from two different worlds. In .NET framework they seem to do the same - a lot of value types provide TryParse method which is a way of separating values from no value (null).

In your particular case it is easy to get rid of the problem because you handle your own Season type.

(Season?)null to me means 'season is not specified' like when you have a webform where some fields are not required. In my opinion it is better to specify that special 'value' in the enum itself rather than use a bit clunky Nullable<T>. It will be faster (no boxing) easier to read (Season.NotSpecified vs null) and will solve your problem with hash codes.

Of course for other types, like int you can't expand value domain and to denominate one of the values as special is not always possible. But with int? hash code collision is much smaller problem, if at all.

Maciej
  • 7,871
  • 1
  • 31
  • 36
  • When you say "boxing", I think you mean "wrapping", i.e. putting a struct value inside a `Nullable<>` struct (where the `HasValue` member will be then set to `true`). Are you sure the problem is really smaller with `int?`? A lot of the time one uses only a few values of `int`, and then it's equivalent to an enum (which can in theory have many members). – Jeppe Stig Nielsen May 29 '12 at 19:48
  • Generally I'd say that enum is chosen when there is limited number of known values required (2-10). If limit is bigger or none, `int` makes more sense. Of course preferences vary. – Maciej May 29 '12 at 21:45
0
Tuple.Create( (object) null! ).GetHashCode() // 0
Tuple.Create( 0 ).GetHashCode() // 0
Tuple.Create( 1 ).GetHashCode() // 1
Tuple.Create( 2 ).GetHashCode() // 2
Denis535
  • 3,407
  • 4
  • 25
  • 36
  • 1
    That’s an interesting approach. It would be useful to edit your answer to include some additional explanation, and especially given the nature of the question. – Jeremy Caney Jun 07 '20 at 00:44