4

I need to use a list of numbers (longs) as a Dictionary key in order to do some group calculations on them.

When using the long array as a key directly, I get a lot of collisions. If I use string.Join(",", myLongs) as a key, it works as I would expect it to, but that's much, much slower (because the hash is more complicated, I assume).

Here's an example demonstrating my problem:

Console.WriteLine("Int32");
Console.WriteLine(new[] { 1, 2, 3, 0}.GetHashCode());
Console.WriteLine(new[] { 1, 2, 3, 0 }.GetHashCode());

Console.WriteLine("String");
Console.WriteLine(string.Join(",", new[] { 1, 2, 3, 0}).GetHashCode());
Console.WriteLine(string.Join(",", new[] { 1, 2, 3, 0 }).GetHashCode());

Output:

Int32
43124074
51601393
String
406954194
406954194

As you can see, the arrays return a different hash.

Is there any way of getting the performance of the long array hash, but the uniqeness of the string hash?

See my own answer below for a performance comparison of all the suggestions.

About the potential duplicate -- that question has a lot of useful information, but as this question was primarily about finding high performance alternatives, I think it still provides some useful solutions that are not mentioned there.

Erlend D.
  • 3,013
  • 6
  • 37
  • 59
  • That's peculiar that your first example is not generating the same hash. Have you considered using tuples as keys instead? – Maximilian Burszley Aug 26 '19 at 14:27
  • 2
    Possible duplicate of [GetHashCode() on byte\[\] array](https://stackoverflow.com/questions/7244699/gethashcode-on-byte-array) – gmiley Aug 26 '19 at 14:30
  • @TheIncorrigible1 Not really taht wierd. Hope was (and it was confirmed here) that they use something like the Reference to generate the hash. Wich does avoid similar array to be considered the same. The 2nd case is just generating a hashcode from teh string "," and the return of "ToString()" (wich should be the type). – Christopher Aug 26 '19 at 14:31
  • 1
    Why would it generate the same hash? It doesn't hash the content of the array. – Haytam Aug 26 '19 at 14:31
  • @Haytam Well, looks like I had a gap in knowledge there. – Maximilian Burszley Aug 26 '19 at 14:32
  • @Haytam: Not an unreasonable expectation that it did, if you ask me ... – 500 - Internal Server Error Aug 26 '19 at 14:34
  • [See also](https://stackoverflow.com/a/3609953/4137916). Note that `Dictionary` has a constructor that accepts a custom `IEqualityComparer`. – Jeroen Mostert Aug 26 '19 at 14:34
  • From @gmiley's link, try `Console.WriteLine(((IStructuralEquatable)new[] { 1, 2, 3, 0 }).GetHashCode(EqualityComparer.Default));` – Sean Reid Aug 26 '19 at 14:35
  • @500-InternalServerError It is a mater of perspective. And my perspective is that too many implicit conversions and opeartions are a bad idea. I am just going to point at the JavaScript and PHP examples from this comics for samples of this: http://www.sandraandwoo.com/2015/12/24/0747-melodys-guide-to-programming-languages/ | Ever since PHP, I love strong typisation and few implicit conversions. – Christopher Aug 26 '19 at 14:39
  • Your strings are returning the same hash codes for the same strings correctly because `string.GetHashCode()` is implemented that way. The implementation of `int[].GetHashCode()` does something with its memory address to return the hash code, so arrays with identical contents will nevertheless return different hash codes. – Matthew Watson Aug 26 '19 at 14:40
  • Your key isn't of the `int[]` type. Well, that's how you may wish to implement it, but it should not be exposed - it's an implementation detail and it's up to you to actually 1. Make an application-specific type, and 2. Implement a hash for it. Problem solved :) – Kuba hasn't forgotten Monica Aug 26 '19 at 15:09
  • See the IEqualityComparer : https://learn.microsoft.com/en-us/dotnet/api/system.collections.generic.iequalitycomparer-1?view=netframework-4.8. It contains a hash and compare method so if the hash collides the compare will detect non matching items. – jdweng Aug 26 '19 at 15:42

5 Answers5

3

That the first one is different is actually good. Arrays are a reference type and luckily they are using the reference (somehow) during hash generation. I would guess that is something like the Pointer that is used on machine code level, or some Garbage Colletor level value. One of the things you have no influence on but is copied if you assign the same instance to a new reference variable.

In the 2nd case you get the hash value on a string consisting of "," and whatever (new[] { 1, 2, 3, 0 }).ToString(); should return. The default is something like teh class name, so of course in both cases they will be the same. And of course string has all those funny special rules like "compares like a value type" and "string interning", so the hash should be the same.

Christopher
  • 9,634
  • 2
  • 17
  • 31
3

Your strings are returning the same hash codes for the same strings correctly because string.GetHashCode() is implemented that way.

The implementation of int[].GetHashCode() does something with its memory address to return the hash code, so arrays with identical contents will nevertheless return different hash codes.

So that's why your arrays with identical contents are returning different hash codes.

Rather than using an array directly as a key, you should consider writing a wrapper class for an array that will provide a proper hash code.

The main disadvantage with this is that it will be an O(N) operation to compute the hash code (it has to be - otherwise it wouldn't represent all the data in the array).

Fortunately you can cache the hash code so it's only computed once.

Another major problem with using a mutable array for a hash code is that if you change the contents of the array after using it for the key of a hashing container such as Dictionary, you will break the container.

Ideally you would only use this kind of hashing for arrays that are never changed.

Bearing all that in mind, a simple wrapper would look like this:

public sealed class IntArrayKey
{
    public IntArrayKey(int[] array)
    {
        Array     = array;
        _hashCode = hashCode();
    }

    public int[] Array { get; }

    public override int GetHashCode()
    {
        return _hashCode;
    }

    int hashCode()
    {
        int result = 17;

        unchecked
        {
            foreach (var i in Array)
            {
                result = result * 23 + i;
            }
        }

        return result;
    }

    readonly int _hashCode;
}

You can use that in place of the actual arrays for more sensible hash code generation.


As per the comments below, here's a version of the class that:

  • Makes a defensive copy of the array so that it cannot be modified.
  • Implements equality operators.
  • Exposes the underlying array as a read-only list, so callers can access its contents but cannot break its hash code.

Code:

public sealed class IntArrayKey: IEquatable<IntArrayKey>
{
    public IntArrayKey(IEnumerable<int> sequence)
    {
        _array    = sequence.ToArray();
        _hashCode = hashCode();

        Array = new ReadOnlyCollection<int>(_array);
    }

    public bool Equals(IntArrayKey other)
    {
        if (other is null)
            return false;

        if (ReferenceEquals(this, other))
            return true;

        return _hashCode == other._hashCode && equals(other.Array);
    }

    public override bool Equals(object obj)
    {
        return ReferenceEquals(this, obj) || obj is IntArrayKey other && Equals(other);
    }

    public static bool operator == (IntArrayKey left, IntArrayKey right)
    {
        return Equals(left, right);
    }

    public static bool operator != (IntArrayKey left, IntArrayKey right)
    {
        return !Equals(left, right);
    }

    public IReadOnlyList<int> Array { get; }

    public override int GetHashCode()
    {
        return _hashCode;
    }

    bool equals(IReadOnlyList<int> other) // other cannot be null.
    {
        if (_array.Length != other.Count)
            return false;

        for (int i = 0; i < _array.Length; ++i)
            if (_array[i] != other[i])
                return false;

        return true;
    }

    int hashCode()
    {
        int result = 17;

        unchecked
        {
            foreach (var i in _array)
            {
                result = result * 23 + i;
            }
        }

        return result;
    }

    readonly int   _hashCode;
    readonly int[] _array;
}

If you wanted to use the above class without the overhead of making a defensive copy of the array, you can change the constructor to:

public IntArrayKey(int[] array)
{
    _array    = array;
    _hashCode = hashCode();

    Array = new ReadOnlyCollection<int>(_array);
}
Matthew Watson
  • 104,400
  • 10
  • 158
  • 276
  • Thanks, this looks like a flexible solution. How are 17 and 23 chosen? Is there any reason to change them? – Erlend D. Aug 26 '19 at 15:30
  • You might also want to override `Equals(object)` and/or implement `IEquatable`. Otherwise, you can add the array `{0, 1, 2, 3}` to the `Dictionary` multiple times. You might also need to clone the `int[]` passed to the constructor. If I create an `int[] arr`, then create an instance of `IntArrayKey` and pass it `arr`, I can modify the `IntArrayKey`'s array through the `arr` reference... – Joshua Robinson Aug 26 '19 at 15:55
  • @JoshuaRobinson Yes, I would definitely add IEquatable and IEquatable, but whether to defensively copy the array or not depends on intended usage. It might not be necessary to keep a reference to the array at all, if the class is only being used for a key - and in that case, you could just manage calculating the hash code directly and storing it as an int key (if maximum performance or reduced memory allocations was needed). – Matthew Watson Aug 26 '19 at 16:55
  • @ErlendD. I just used the numbers that Resharper uses when it auto-generates GetHashCode() for you. Any two prime numbers over 13 should be fine. – Matthew Watson Aug 26 '19 at 17:11
  • Thanks! This worked quite well, but was not the fastest solution. See my answer for a comparison. – Erlend D. Aug 27 '19 at 06:39
3

Another alternative is to leverage the lesser known IEqualityComparer to implement your own hash and equality comparisons. There are some notes you'll need to observe about building good hashes, and it's generally not good practice to have editable data in your keys, as it'll introduce instability should the keys ever change, but it would certainly be more performant than using string joins.

public class ArrayKeyComparer : IEqualityComparer<int[]>
{
    public bool Equals(int[] x, int[] y)
    {
        return x == null || y == null 
            ? x == null && y == null 
            : x.SequenceEqual(y);
    }

    public int GetHashCode(int[] obj)
    {
        var seed = 0;

        if(obj != null)
            foreach (int i in obj)
                seed %= i.GetHashCode();

        return seed;
    }
}

Note that this still may not be as performant as a tuple, since it's still iterating the array rather than being able to take a more constant expression.

David
  • 10,458
  • 1
  • 28
  • 40
  • Thanks, I'll try that as well! I'll post the results of my performance testing tomorrow. – Erlend D. Aug 26 '19 at 16:34
  • Tested it. Very similar to Jon Skeet's suggestion in the other post, but this was much slower. See my answer for a comparison. – Erlend D. Aug 27 '19 at 06:39
1

If you know the length of the arrays you're using, you could use a Tuple.

Console.WriteLine("Tuple");
Console.WriteLine(Tuple.Create(1, 2, 3, 0).GetHashCode());
Console.WriteLine(Tuple.Create(1, 2, 3, 0).GetHashCode());

Outputs

Tuple
1248
1248
David
  • 10,458
  • 1
  • 28
  • 40
  • Thank you! I'll have to test this, but unless there is a performance gain, I would prefer to keep it an array. – Erlend D. Aug 26 '19 at 15:30
1

I took all the suggestions from this question and the similar byte[].GetHashCode() question, and made a simple performance test.

The suggestions are as follows:

  • int[] as key (original attempt -- does not work at all, included as a benchmark)
  • string as key (original solution -- works, but slow)
  • Tuple as key (suggested by David)
  • ValueTuple as key (inspired by the Tuple)
  • Direct int[] hash as key
  • IntArrayKey (suggested by Matthew Watson)
  • int[] as key with Skeet's IEqualityComparer
  • int[] as key with David's IEqualityComparer

I generated a List containing one million int[]-arrays of length 7 containing random numbers between 100 000 and 999 999 (which is an approximation of my current use case). Then I duplicated the first 100 000 of these arrays, so that there are 900 000 unique arrays, and 100 000 that are listed twice (to force collisions).

For each solution, I enumerated the list, and added the keys to a Dictionary, OR incremented the Value if the key already existed. Then I printed how many keys had a Value more than 1**, and how much time it took.

The results are as follows (ordered from best to worst):

Algorithm                Works?   Time usage
NonGenericSkeetEquality  YES            392 ms
SkeetEquality            YES            422 ms
ValueTuple               YES            521 ms
QuickIntArrayKey         YES            747 ms
IntArrayKey              YES            972 ms
Tuple                    YES          1 609 ms
string                   YES          2 291 ms
DavidEquality            YES      1 139 200 ms ***
int[]                    NO             336 ms
IntHash                  NO             386 ms

The Skeet IEqualityComparer is only slightly slower than using the int[] as key directly, with the huge advantage that it actually works, so I'll use that.

** I'm aware that this is not a completely fool proof solution, as I could theoretically get the expected number of collisions without it actually being the collisions I expected, but having run the test a lot of times, I'm fairly certain I don't.

*** Did not finish, probably due to poor hashing algorithm and a lot of equality checks. Had to reduce the number of arrays to 10 000, then multiply the time usage by 100 to compare with the others.

Erlend D.
  • 3,013
  • 6
  • 37
  • 59
  • You have to beware of the equality approach, because GetHashCode() is documented to return the same value for objects with the same data - that means that code could assume that two objects are different if their hash codes are different, even if their actual data is identical. This could cause incorrect operation in some cases (but that depends on the actual code of course). – Matthew Watson Aug 27 '19 at 07:39
  • ..But if you are also using the GetHashCode() implementation, then I'm surprised Jon's code is faster, since it is effectively doing the same loop (but with the additional overhead of calling `elementComparer.GetHashCode()` for each element of the array rather than just using the `int` value as the hash code. I guess we'd need to see your actual test code to see why that is. – Matthew Watson Aug 27 '19 at 08:26
  • @MatthewWatson: The code is identical to Skeet's code in the other question, but here are all my tests: https://pastebin.com/st8zS7S2 Wouldn't the combination of GetHashCode() and IsEqual() that the IEqualityComparer uses guarantee correct operation in all cases? – Erlend D. Aug 27 '19 at 08:29
  • Yes, if it's actually using the equality part (which from the look of it, it is). I can see the reason for the difference in speed - Jon's code isn't making a defensive copy of the array, which is fine as long as you never change the contents of the array. So for a proper speed comparison, you should change my class to avoid making a copy of the array. I'll update my answer to include that version, for completeness. – Matthew Watson Aug 27 '19 at 08:38
  • I added the "QuickIntArrayKey" to the table. It's faster, but still a tad slower than the IEqualityComparer. My guess is that it's because the IntArrayKey has to instantiate a new class for every key, while the IEqualityComparer simply overrides the actual comparison. – Erlend D. Aug 27 '19 at 09:07
  • Yes, that sounds plausible. You might get a speedup by modifying Jon's code so that it is just for `int` rather than making it generic, then you can avoid the overhead of calling `elementComparer.GetHashCode()` (you can just use the value of the int itself). – Matthew Watson Aug 27 '19 at 09:24
  • Right you are, that gave another speedup (added to the table). Thanks! – Erlend D. Aug 27 '19 at 09:56