Grouping unique integer arrays

Question

One module in my app generates a small array of integers. Typically the size is 25 integers. The integers tend to be pretty small, less than 10000. I'll like to save all the unique arrays in a container of some sort. The number of arrays generated can be in the millions.

So, for every new array I need to figure out if it already exits. And if it does what's the index.

A naive approach is to keep all arrays in a list and then just call:

MyList.FindIndex(x=>x.SequenceEqual(Small_Array));

But this becomes very slow if the number of arrays is getting into the thousands.

A less naive approach is to store all arrays in a dictionary where the key is a hash value from the array. If the hash is just another integer (32bit) than I have cannot find a good hashing algorithm which doesn't collides.

Which, I think leaves me to using a hashing algorithm like MD5 that can be converted into a 128bit integer. Is that a good way to tackle my problem?

Why does it matter if the hashing algorithm collides or not? Why not just use the hash as a basis for deciding if you should check the sequence? — ProgrammingLlama, Jun 06 '18 at 15:21
Hashcodes are allowed to collide - it's just a way to filter out obvious not equals. — Dan Rayson, Jun 06 '18 at 15:21
@DanRayson: They're allowed to collide when computed for a key - but the OP was suggesting using the hash *itself* as the key, and that wouldn't work. — Jon Skeet, Jun 06 '18 at 15:24
take a look at https://stackoverflow.com/questions/670063/getting-hash-of-a-list-of-strings-regardless-of-order or https://stackoverflow.com/questions/8094867/good-gethashcode-override-for-list-of-foo-objects-respecting-the-order Depending on whether you care about order — Yair Halberstadt, Jun 06 '18 at 15:25
@YairHalberstadt: I don't think the OP wants an order-agnostic comparison. — Jon Skeet, Jun 06 '18 at 15:27

Jon Skeet · Accepted Answer · 2018-06-06T15:38:18.663

4

Rather than make the key the hash, make it the array itself - with a custom comparer. The value would be the notional "index".

The comparer doesn't need to be hugely efficient, nor does the hash generation need to go to great length to avoid duplicates, so long as there aren't too many collisions. (You should potentially add logging to check that.) Here's a really simple start:

public class Int32ArrayEqualityComparer : IEqualityComparer<int[]>
{
    // Note: SequenceEqual already checks the count before looking at content.
    public bool Equals(int[] first, int[] second) =>
        first.SequenceEqual(second);

    public int GetHashCode(int[] array)
    {
        unchecked
        {
            int hash = 23;
            foreach (var item in array)
            {
                hash = hash * 31 + item;
            }
            return hash;
        }
    }
}

You'd then create the dictionary like this:

var arrayMap = new Dictionary<int[], int>(new Int32ArrayEqualityComparer());

Then you'd have something like:

public int MaybeAddArray(int[] array)
{
    if (!arrayMap.TryGetValue(array, out var index))
    {
        index = arrayMap.Count + 1;
        arrayMap[array] = index;
    }
    return index;
}

Note that ConcurrentDictionary has simpler ways of doing this. Also note that the "index" is somewhat artificial here. You may not even need this, depending on what you're doing.

edited Jun 06 '18 at 15:38

answered Jun 06 '18 at 15:24

Jon Skeet

1,421,763
867
9,128
9,194

I'm wondering if there's a mathematical way to know the correct number of elements from the array to use when generating the hashcode. Less means faster but more likely to need to check entire collection, More means slower but less likely to require full checking. Balancing act? – Dan Rayson Jun 06 '18 at 15:28
@DanRayson: It's only going to be once per array, so I suspect it's not much of an issue. The OP can always tweak that if necessary. – Jon Skeet Jun 06 '18 at 15:31
@DanRayson That's depend on the way the elements are generated. One way is to logically identify the portions of the arrays to most likely collide, and try to ignore those portions when generating the hash. An example might be to ignore `M`and `N` in `xxxxxxxx-xxxx-Mxxx-Nxxx-xxxxxxxxxxxx` for GUIDs (a silly example since GUIDs are already a compact unique identifier. – cwharris Jun 06 '18 at 15:31
When I say ignore, I mean compare them last - compare the items in the order of how likely they are to differ in value. – cwharris Jun 06 '18 at 15:35
@cwharris: That can only be done in `Equals` - not in `GetHashCode`. I've just thought of a minor optimization though... – Jon Skeet Jun 06 '18 at 15:38
@DaisyShipton Yes. I somehow managed to mangle those ideas. The hash generation could use the most likely values to change, and the equals could compare in order of likelihood. – cwharris Jun 06 '18 at 15:40

Grouping unique integer arrays

1 Answers1