Why GetHashCode method needs to do shift in C#

Question

According to MSDN GetHashCode Method:

public struct Point
{
    private int x;
    private int y;

    public Point(int x, int y)
    {
       this.x = x;
       this.y = y;
    }

    public override bool Equals(Object obj)
    {
       if (!(obj is Point)) return false;

       Point p = (Point) obj;
       return x == p.x & y == p.y;
    }

    public override int GetHashCode()
    { 
        return ShiftAndWrap(x.GetHashCode(), 2) ^ y.GetHashCode();
    } 

    private int ShiftAndWrap(int value, int positions)
    {
        positions = positions & 0x1F;

        // Save the existing bit pattern, but interpret it as an unsigned integer.
        uint number = BitConverter.ToUInt32(BitConverter.GetBytes(value), 0);
        // Preserve the bits to be discarded.
        uint wrapped = number >> (32 - positions);
        // Shift and wrap the discarded bits.
        return BitConverter.ToInt32(BitConverter.GetBytes((number << positions) | wrapped), 0);
    }
}

I'm confused about the ShiftAndWrap Method, I know that is used to avoid generating collision hashcode. But I have questions as follows:

Why the parameter positions is set as 2?
Why the method do right-shift (32-positions) first then do left-shift positons, Does it have specific meaning?
As mentioned above, this method is used to reduce the situation of having collision, e.g. new Point(5,8) vs new Point(8,5), but if I create an object like new Point(3,16), it will get the same hashcode as new Point(5,8) did, so... what's the real effect of this method?

I personally cannot see the necessity of complex and hardly understandable mathematics for hashcode computation in this case. To solve your problem, why not use one of the simpler algorithms on msdn using a tuple? If results of a hash code function are not unique, this function makes no sense for hash code computation. — Michael Würthner, Jan 09 '17 at 06:31
@MichaelWürthner Exactly, the Tuple object is more easily, but it may significantly impact the overall performance of an application when stores large numbers of objects in hash tables. — Allen Chen, Jan 09 '17 at 06:55
@陳品翰 so the one way is to do the xor and ignore the bits after 32 bits, I someone go with that way then a value with more than 32 bit and less than 32 bit can collide, but instead of discarding the bit if we wrap those then the collision can be minimize. — Ranjit Singh, Jan 09 '17 at 07:07
It doesn't reduce collisions (it can't - there's still 64 bits of input and 32 bits of output). What it does do is avoid having the collisions all being symmetrical about the line *y = x*. — Damien_The_Unbeliever, Jan 09 '17 at 07:55
Short of asking the author, it’s difficult to tell *why* this was done when the code does not contain any hints. For the reasons you’ve discovered yourself, it doesn’t seem to be the best idea though and obviously also isn’t simple to understand on its own, so you’re better off using a different technique for computing a hash value. — poke, Jan 09 '17 at 08:06
@Damien_The_Unbeliever yeah... I think it won't reduce collision too.. So this algorithm is just avoid the line symmetrical like new Point(5,8) and new Point(8,5) ?? why should we avoid symmetrical? the Equals method will identify them, won't it? — Allen Chen, Jan 09 '17 at 08:18

JLRishe · Accepted Answer · 2017-02-07T20:05:59.303

I couldn't say why they chose this particular hash code implementation, but with regard to this question:

Why the method do right-shift (32-positions) first then do left-shift positons, Does it have specific meaning?

The ShiftAndWrap() method here is a generic implementation of an algorithm for left-shifting a value by N bits and wrapping the overflow back to the end. So before they do the shift, they first get the left-most N bits so they can then append those onto the end.

So here's what calling ShiftAndWrap() would look like if we were just working with 8-bit values (bytes) and called it with value = (binary) 11010010 and positions = 3:

value = 11010010

positions = 3

wrapped = value >> (8 - positions)
        = 11010010 >> (8 - 3) 
        = 11010010 >> 5 
        = 00000110

result = value << positions | wrapped
       = 11010010 << 3 | 00000110 
       = 10010000 | 00000110 
       = 10010110

We can see that the return value 10010110 is the result of shifting 11010010 by three bits and wrapping around the result.

As to the question of why they don't just use x ^ y, I suspect this is because this would mean that Point(N, M) would always produce the same hash code as Point(M, N). By doing a shift on the x value, we can have a hash code that not only takes into account the x and y values, but also their order, whereas x ^ y would ignore their order.

When doing hashing on a data structure that contains sub-components of the same type, it's common to have the hash function treat each of the sub-components differently so that their position matters. For example, Java uses this hash formula for strings (here ^ denotes an exponent, not XOR):

s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1]

We can see that each character is multiplied by a different power of 31, so that stop has a different hash code from pots.

As for why they chose 2 as the number of positions to shift, that might be arbitrary, or they may have done some evaluations to see what degree of shifting would be likely to produce the best distribution.

ohhhh! I get it!! ur explanation is clear. thank you very much!! — Allen Chen, Jan 09 '17 at 07:23

score 0 · Answer 2 · answered Jan 09 '17 at 06:28

The point of the HashCodeis to create a distribution so that data structures can allocate the data into certain buckets. Its not meant for equality.

If you look at the internals for the HashSet you can see that the class uses the HashCode to identify the correct bucket and then uses the Equals method to determine equality.

    /// <summary>
    /// Checks if this hashset contains the item
    /// </summary>
    /// <param name="item">item to check for containment</param>
    /// <returns>true if item contained; false if not</returns>
    public bool Contains(T item) {
        if (m_buckets != null) {
            int hashCode = InternalGetHashCode(item);
            // see note at "HashSet" level describing why "- 1" appears in for loop
            for (int i = m_buckets[hashCode % m_buckets.Length] - 1; i >= 0; i = m_slots[i].next) {
                if (m_slots[i].hashCode == hashCode && m_comparer.Equals(m_slots[i].value, item)) {
                    return true;
                }
            }
        }
        // either m_buckets is null or wasn't found
        return false;
    }

So collisions are fine, it just just there so ensure a relatively equal distribution to allow for faster identification and retrieval. Meaning that, in your case, both of those points will be placed in the same bucket, but their Equals method will be used to identify them.

yes! I know the hashcode is just make identify more faster, but what I can't understand is that the simple GetHashCode method is: public override int GetHashCode() { return x ^ y; } why it needs ShiftAndWrap? This can really minimize collision?? — Allen Chen, Jan 09 '17 at 06:45

Why GetHashCode method needs to do shift in C#

2 Answers2