69

I would like to write a function GetHashCodeOfList() which returns a hash-code of a list of strings regardless of order. Given 2 lists with the same strings should return the same hash-code.

ArrayList list1 = new ArrayList()    
list1.Add("String1");
list1.Add("String2");
list1.Add("String3");    

ArrayList list2 = new ArrayList()    
list2.Add("String3");    
list2.Add("String2"); 
list2.Add("String1");

GetHashCodeOfList(list1) = GetHashCodeOfList(list2) //this should be equal.

I had a few thoughts:

  1. I can first sort the list, then combine the sorted list into 1 long string and then call GetHashCode(). However sorting is a slow operation.

  2. I can get the hash of each individual string (by calling string.GetHashCode()) in the list, then multiplying all hashes and calling Mod UInt32.MaxValue. For example: "String1".GetHashCode() * "String2".GetHashCode * … MOD UInt32.MaxValue. But this results in a number overflow.

Does anyone have any thoughts?

Thanks in advance for your help.

nawfal
  • 70,104
  • 56
  • 326
  • 368
MaxK
  • 3,511
  • 4
  • 20
  • 16
  • 4
    In your your example, the two hashes _could well_ be equal, because you are trying to assign the hash of list2 to the hash of list1. :P – ProfK Mar 22 '09 at 07:54

5 Answers5

83

There are various different approaches here the under two main categories, each typically with their own benefits and disadvantages, in terms of effectiveness and performance. It is probably best to choose the simplest algorithm for whatever application and only use the more complex variants if necessary for whatever situation.

Note that these examples use EqualityComparer<T>.Default since that will deal with null elements cleanly. You could do better than zero for null if desired. If T is constrained to struct it is also unnecessary. You can hoist the EqualityComparer<T>.Default lookup out of the function if so desired.

Commutative Operations

If you use operations on the hashcodes of the individual entries which are commutative then this will lead to the same end result regardless of order.

There are several obvious options on numbers:

XOR

public static int GetOrderIndependentHashCode<T>(IEnumerable<T> source)
{
    int hash = 0;
    foreach (T element in source)
    {
        hash = hash ^ EqualityComparer<T>.Default.GetHashCode(element);
    }
    return hash;
}

One downside of that is that the hash for { "x", "x" } is the same as the hash for { "y", "y" }. If that's not a problem for your situation though, it's probably the simplest solution.

Addition

public static int GetOrderIndependentHashCode<T>(IEnumerable<T> source)
{
    int hash = 0;
    foreach (T element in source)
    {
        hash = unchecked (hash + 
            EqualityComparer<T>.Default.GetHashCode(element));
    }
    return hash;
}

Overflow is fine here, hence the explicit unchecked context.

There are still some nasty cases (e.g. {1, -1} and {2, -2}, but it's more likely to be okay, particularly with strings. In the case of lists that may contain such integers, you could always implement a custom hashing function (perhaps one that takes the index of recurrence of the specific value as a parameter and returns a unique hash code accordingly).

Here is an example of such an algorithm that gets around the aforementioned problem in a fairly efficient manner. It also has the benefit of greatly increasing the distribution of the hash codes generated (see the article linked at the end for some explanation). A mathematical/statistical analysis of exactly how this algorithm produces "better" hash codes would be quite advanced, but testing it across a large range of input values and plotting the results should verify it well enough.

public static int GetOrderIndependentHashCode<T>(IEnumerable<T> source)
{
    int hash = 0;
    int curHash;
    int bitOffset = 0;
    // Stores number of occurences so far of each value.
    var valueCounts = new Dictionary<T, int>();

    foreach (T element in source)
    {
        curHash = EqualityComparer<T>.Default.GetHashCode(element);
        if (valueCounts.TryGetValue(element, out bitOffset))
            valueCounts[element] = bitOffset + 1;
        else
            valueCounts.Add(element, bitOffset);

        // The current hash code is shifted (with wrapping) one bit
        // further left on each successive recurrence of a certain
        // value to widen the distribution.
        // 37 is an arbitrary low prime number that helps the
        // algorithm to smooth out the distribution.
        hash = unchecked(hash + ((curHash << bitOffset) |
            (curHash >> (32 - bitOffset))) * 37);
    }

    return hash;
}

Multiplication

Which has few if benefits over addition: small numbers and a mix of positive and negative numbers they may lead to a better distribution of hash bits. As a negative to offset this "1" becomes a useless entry contributing nothing and any zero element results in a zero. You can special-case zero not to cause this major flaw.

public static int GetOrderIndependentHashCode<T>(IEnumerable<T> source)
{
    int hash = 17;
    foreach (T element in source)
    {
        int h = EqualityComparer<T>.Default.GetHashCode(element);
        if (h != 0)
            hash = unchecked (hash * h);
    }
    return hash;
}

Order first

The other core approach is to enforce some ordering first, then use any hash combination function you like. The ordering itself is immaterial so long as it is consistent.

public static int GetOrderIndependentHashCode<T>(IEnumerable<T> source)
{
    int hash = 0;
    foreach (T element in source.OrderBy(x => x, Comparer<T>.Default))
    {
        // f is any function/code you like returning int
        hash = f(hash, element);
    }
    return hash;
}

This has some significant benefits in that the combining operations possible in f can have significantly better hashing properties (distribution of bits for example) but this comes at significantly higher cost. The sort is O(n log n) and the required copy of the collection is a memory allocation you can't avoid given the desire to avoid modifying the original. GetHashCode implementations should normally avoid allocations entirely. One possible implementation of f would be similar to that given in the last example under the Addition section (e.g. any constant number of bit shifts left followed by a multiplication by a prime - you could even use successive primes on each iteration at no extra cost, since they only need be generated once).

That said, if you were dealing with cases where you could calculate and cache the hash and amortize the cost over many calls to GetHashCode this approach may yield superior behaviour. Also the latter approach is even more flexible since it can avoid the need to use the GetHashCode on the elements if it knows their type and instead use per byte operations on them to yield even better hash distribution. Such an approach would likely be of use only in cases where the performance was identified as being a significant bottleneck.

Finally, if you want a reasonably comprehensive and fairly non-mathematical overview of the subject of hash codes and their effectiveness in general, these blog posts would be worthwhile reads, in particular the Implementing a simple hashing algorithm (pt II) post.

Bevan
  • 43,618
  • 10
  • 81
  • 133
Jon Skeet
  • 1,421,763
  • 867
  • 9,128
  • 9,194
  • 1
    By accepting a little worse distribution you could use hash += element.GetHashCode() to get rid of (x,x) = (y,y). – Rauhotz Mar 21 '09 at 22:13
  • I was just thinking exactly the same thing, ironically. Will edit to give that as an alternative. – Jon Skeet Mar 21 '09 at 22:14
  • Right, so now our two solutions have pretty much converged. Adding the unchecked keyword to deal with overflows (i.e. in the case of long lists) would still be an improvement I'd think. If you want to add in that, I might as well then delete my post... – Noldorin Mar 21 '09 at 22:45
  • Righto, will do. Will also make this community wiki, as it's truly been a team effort :) – Jon Skeet Mar 21 '09 at 22:47
  • Ok, great. I think we have a pretty nice solution now. :) – Noldorin Mar 21 '09 at 23:08
  • I've edited the post regarding the latest "nasty case" you just pointed out - hope that's alright. – Noldorin Mar 21 '09 at 23:11
  • I added multiplication for completeness and showed how the commutative approach differs from the sort then do what you like approach. Hope no one minds. I liminated the references to who suggested what, it makes for a cleaner answer. – ShuggyCoUk Mar 22 '09 at 17:38
  • Rereading this http://msdn.microsoft.com/en-us/library/ms132155.aspx says GetHashcode will throw if the element is null. fixing this – ShuggyCoUk Mar 22 '09 at 17:49
  • @ShuggyCoUk: Note that we already resolved this problem by using EqualityComparer. I'll leave it to Jon whether he wants to rollback anything. – Noldorin Mar 22 '09 at 18:01
  • I think this answer is a prime example of what SO should be about. I don't think anyone will get any rep from it (although I admit I'd get any badges earned - not a big deal) but it's been a real team effort. The result is a pretty definitive response IMO. I wish more answers ended up like this. – Jon Skeet Mar 22 '09 at 20:15
  • Yeah, it really looks like a great answer now (especially with the improved layout and formatting!). This does indeed show how well the wiki aspect of SO can work... Oh, and curse you for getting any badges that come out this. :P – Noldorin Mar 22 '09 at 20:24
  • in wiki's this sort of thing gets Barnstars. I don't think we need them :) I learnt something from my edits... – ShuggyCoUk Mar 22 '09 at 20:45
  • I wonder if we should itegrate the core idea from guffa into the Ordering section. sorting the hashcodes is going to be much faster (and will work even on types that have no well defined order of their own) – ShuggyCoUk Mar 22 '09 at 20:48
  • @Shuggy: Faster than ordering the elements themselves, yes. Still O(n log n) though. I'll add a comment in Guffa's answer to suggest that he might want to join in. It would be rude just to copy. – Jon Skeet Mar 22 '09 at 20:58
  • Just added an improved algorithm to the Addition section, which I think overcomes one or two more problems. :) Let me know what you think (or just edit the post). – Noldorin Mar 22 '09 at 21:10
  • Doesn't that get rid of the whole nature of it being order-neutral? It doesn't seem to give the same results for { 1, 2 } as { 2, 1 } for example. – Jon Skeet Mar 22 '09 at 21:32
  • Yeah, that slipped my mind for some reason. I've gone an edited the code to provide what now seems to be an order-independent algorithm. It's moderately complex, so I'm not sure if it's the best option, though I suspect it's more efficient than the Order first algorithm. – Noldorin Mar 22 '09 at 21:59
  • I would suggest avoiding the prime related magic. All the best hash algorithms avoid it and propagating the idea can be counter productive. I suggest http://bretm.home.comcast.net/~bretm/hash/11.html as an alternate link for the hard core testing and explanation on hashes with a c# bias... – ShuggyCoUk Mar 22 '09 at 23:02
  • I would also say that the answer is straying from being as useful to people coming in with less advanced knowledge (inevitably the best answer for one group will not be as good for others) – ShuggyCoUk Mar 22 '09 at 23:05
  • If you would care to provide proof that algorithms involving prime numbers are actually counter-productive, then maybe. But I consider them a useful and simple way of increasing distribution... – Noldorin Mar 22 '09 at 23:07
  • also http://www.azillionmonkeys.com/qed/hash.html, http://www.cris.com/~ttwang/tech/inthash.htm and quite a few more. Hashing technique and theory has come a long way since knuth wrote ACP, hopefully he revises this in verson 4 – ShuggyCoUk Mar 22 '09 at 23:45
  • Fair enough... though if Knuth used it, whenever, I'm happy to do so too! Let's leave it how it is since as a simple method, it seems to be "good enough". If whoever is reading this wants to get more involved with the maths, they surely have the links with which to proceed. – Noldorin Mar 22 '09 at 23:51
  • On the multiply algorithm, am I right in saying that hash is always 0? – Dr Rob Lang Aug 20 '14 at 14:26
  • @RobLang: Yup, fixed. – Jon Skeet Aug 20 '14 at 15:30
24

An alternative to sorting the string lists would be to get the hash codes of the strings and then sort the hash codes. (Comparing ints is less expensive than comparing strings.) You can then use an algorithm to merge the hash codes that (hopefully) gives a better distribution.

Example:

GetHashCodeOfList<T>(IEnumerable<T> list) {
   List<int> codes = new List<int>();
   foreach (T item in list) {
      codes.Add(item.GetHashCode());
   }
   codes.Sort();
   int hash = 0;
   foreach (int code in codes) {
      unchecked {
         hash *= 251; // multiply by a prime number
         hash += code; // add next hash code
      }
   }
   return hash;
}
TeaDrivenDev
  • 6,591
  • 33
  • 50
Guffa
  • 687,336
  • 108
  • 737
  • 1,005
  • Guffa, do you feel like adding this into the accepted answer, which is turning into a fantastic wiki answer? It would be nice to incorporate this idea. – Jon Skeet Mar 22 '09 at 20:59
  • 2
    I don't know what the precise use case is here, but if this operation has to be performed several times, for example between adds, then it might be useful to calculate the hashcodes directly when adding new items? – Benjol Mar 23 '09 at 06:27
  • wouldn't GetHashCodeOfList(new int[0]) and GetHashCodeOfList(new int[]{0}) both give a hashcode of 0? Therefore int hash=0 might be better replaced with int hash = codes.Any()?7:0; – Brent Jul 14 '14 at 05:24
  • @Brent: Good point, but you can just start from a value other than zero to get different results for them. Anyhow, having different lists return the same hash code is not a problem, unless you have lists that all contain less than 32 bits of data, it's impossible to avoid it. – Guffa Jul 14 '14 at 14:04
0

A lot less code but maybe the performance isn't as good as the other answers:

public static int GetOrderIndependentHashCode<T>(this IEnumerable<T> source)    
    => source == null ? 0 : HashSet<T>.CreateSetComparer().GetHashCode(new HashSet<T>(source));
Matthew Kane
  • 321
  • 1
  • 7
  • The implementation of [`HashSetEqualityComparer.GetHashCode`](https://referencesource.microsoft.com/system.core/system/Collections/Generic/HashSetEqualityComparer.cs.html#9ee3901938563044) is actually a XOR, so this solution is it not only inefficient but also produces mediocre hashcodes. – Theodor Zoulias Apr 18 '19 at 14:47
0

Here is a hybrid approach. It combines the three commutative operations (XOR, addition and multiplication), applying each one in different ranges of the 32bit number. The bit-range of each operation is adjustable.

public static int GetOrderIndependentHashCode<T>(IEnumerable<T> source)
{
    var comparer = EqualityComparer<T>.Default;
    const int XOR_BITS = 10;
    const int ADD_BITS = 11;
    const int MUL_BITS = 11;
    Debug.Assert(XOR_BITS + ADD_BITS + MUL_BITS == 32);
    int xor_total = 0;
    int add_total = 0;
    int mul_total = 17;
    unchecked
    {
        foreach (T element in source)
        {
            var hashcode = comparer.GetHashCode(element);
            int xor_part = hashcode >> (32 - XOR_BITS);
            int add_part = hashcode << XOR_BITS >> (32 - ADD_BITS);
            int mul_part = hashcode << (32 - MUL_BITS) >> (32 - MUL_BITS);
            xor_total = xor_total ^ xor_part;
            add_total = add_total + add_part;
            if (mul_part != 0) mul_total = mul_total * mul_part;
        }
        xor_total = xor_total % (1 << XOR_BITS); // Compact
        add_total = add_total % (1 << ADD_BITS); // Compact
        mul_total = mul_total - 17; // Subtract initial value
        mul_total = mul_total % (1 << MUL_BITS); // Compact
        int result = (xor_total << (32 - XOR_BITS)) + (add_total << XOR_BITS) + mul_total;
        return result;
    }
}

The performance is almost identical with the simple XOR method, because the call to GetHashCode of each element dominates the CPU demand.

Theodor Zoulias
  • 34,835
  • 7
  • 69
  • 104
0
    Dim list1 As ArrayList = New ArrayList()
    list1.Add("0")
    list1.Add("String1")
    list1.Add("String2")
    list1.Add("String3")
    list1.Add("abcdefghijklmnopqrstuvwxyz")

    Dim list2 As ArrayList = New ArrayList()
    list2.Add("0")
    list2.Add("String3")
    list2.Add("abcdefghijklmnopqrstuvwxyz")
    list2.Add("String2")
    list2.Add("String1")
    If GetHashCodeOfList(list1) = GetHashCodeOfList(list2) Then
        Stop
    Else
        Stop
    End If
    For x As Integer = list1.Count - 1 To 0 Step -1
        list1.RemoveAt(list1.Count - 1)
        list2.RemoveAt(list2.Count - 1)
        Debug.WriteLine(GetHashCodeOfList(list1).ToString)
        Debug.WriteLine(GetHashCodeOfList(list2).ToString)
        If list1.Count = 2 Then Stop
    Next


Private Function GetHashCodeOfList(ByVal aList As ArrayList) As UInt32
    Const mask As UInt16 = 32767, hashPrime As Integer = Integer.MaxValue
    Dim retval As UInt32
    Dim ch() As Char = New Char() {}
    For idx As Integer = 0 To aList.Count - 1
        ch = DirectCast(aList(idx), String).ToCharArray
        For idCH As Integer = 0 To ch.Length - 1
            retval = (retval And mask) + (Convert.ToUInt16(ch(idCH)) And mask)
        Next
    Next
    If retval > 0 Then retval = Convert.ToUInt32(hashPrime \ retval) 'Else ????
    Return retval
End Function
dbasnett
  • 11,334
  • 2
  • 25
  • 33