0

I am working with existing data and have records which contain an array double[23] and double[46]. The values in the array can be the same across multiple records. I would like to generate an id (perhaps an int) to uniquely identify the values in each array.

There are places in the application where I need to group records based on the values in the array being identical. While there are ways to query for this, I was hoping for a single int field (or something similar) to group on. This would really help simplify queries and especially help with report tools where grouping on a smaller single field would help immensely.

I thought of generating a hash code, but I understand these are not guaranteed to be the same for each double[] with matching values. I had tried implementing

((IStructuralEquatable)combined).GetHashCode(EqualityComparer<double>.Default);

To compare the structure and data, but again, I don't think this is guaranteed to match another double[] having the same values.

Perhaps a form of checksum would work but admittedly I am having trouble implementing something. I am looking for suggestions/direction.

Here is data for 3 sample records. Data in record 1&3 are the same so a generated id should match for those. 32.7,48.9,55.9,48.9,47.7,46.9,45.7,44.4,43.4,41.9,40.4,38.4,36.7,34.4,32.4,30.4,27.9,25.4,22.4,19.4,16.4,13.4,10.4,47.9 40.8,49.0,50.0,49.0,47.8,47.0,45.8,44.5,43.5,42.0,40.5,38.5,36.8,34.5,32.5,30.5,28.0,25.5,22.5,19.5,16.5,13.5,10.5,48.0 32.7,48.9,55.9,48.9,47.7,46.9,45.7,44.4,43.4,41.9,40.4,38.4,36.7,34.4,32.4,30.4,27.9,25.4,22.4,19.4,16.4,13.4,10.4,47.9

Perhaps this is not possible without just checking all the data, but was hoping for a better solution to simplify the application and improve the speed.

The goal is to add a new id field to the existing records to represent the array data. That way, passing records into report tools would group together easily on one field rather than checking the whole array on each record.

I appreciate any direction.

EDIT - Some issues I ran into trying things (incase it helps someone)

In trying to understand this originally, I was calling this code (which is part of .NET). I understood these functions would hash the values of the array together (only 8 values in this case). I didn't think it included the array handle. The result was not quite as expected as there is a bug MS corrected in .NET as per the commented line below. With the fix I was getting better results.

int IStructuralEquatable.GetHashCode(IEqualityComparer comparer) {
        if (comparer == null)
            throw new ArgumentNullException("comparer");
        Contract.EndContractBlock();

        int ret = 0;

        for (int i = (this.Length >= 8 ? this.Length - 8 : 0); i < this.Length; i++) {
            ret = CombineHashCodes(ret, comparer.GetHashCode(GetValue(i))); 
//.NET 4.6.2, in .NET 4.5.2 it is ret = CombineHashCodes(ret, comparer.GetHashCode(GetValue(0))) 
        }

        return ret;
    }

    internal static int CombineHashCodes(int h1, int h2) {
        return (((h1 << 5) + h1) ^ h2);
    }

I modified this to handle more than 8 values and still had some hashes not matching. I later determined the issue was in the data; I was unaware some of the records had some doubles stored with more than one decimal place (should have been rounded). This of course changed the hash. Now that I have the data consistent, I am seeing matching hashes; any arrays with identical values have an identical hash.

Kerfluffel
  • 117
  • 9
  • A hash code is probably the closest you can get with a single value, with the chance of collisions. For 100% certainity, you need to loop and compare each value. – Alejandro Jun 02 '21 at 14:56
  • Might want to read more about the C# `GetHashCode`, it's not really a cryptographic hash, more of a convenience to check if things are different. See https://stackoverflow.com/a/7425150/1462295 and note `Two unequal objects are not guaranteed to have unequal hashcodes` which is more of a concern with a single 32 bit hash value. – BurnsBA Jun 02 '21 at 15:28
  • @Alejandro I was hoping there would be another way but as I understand it gethashcode always has a chance of collisions so some array ids may not match, even though the data is identical. I thought the IStructuralEquatable code might compare the array structure and data, but still seems to be a chance of collisions and therefore id may not uniquely represent data all the time. Thanks for the input. – Kerfluffel Jun 02 '21 at 15:28
  • @BurnsBA Thanks BurnsBA. I had learned some of this as I go and understand there can be collisions. I was hoping for another algorithm to identify the arrays but not looking too promising. I am not tied to an int, was just looking for something better than comparing the whole array for each record but not sure I'll get away from it. Perhaps a cryptographic hash may not have collisions but I have to read up on it as I have no experience with it as of yet. – Kerfluffel Jun 02 '21 at 15:45
  • 1
    Compare the number of possible combinations of values in a double[23] withe the number of possible values of an int and you will see that this cannot work. – Klaus Gütter Jun 02 '21 at 15:53
  • @KlausGütter Ah, yes, I see your point. I wasn't tied to an int in particular, was just looking for some reliable solution but hadn't thought of this. Thanks – Kerfluffel Jun 02 '21 at 15:58

1 Answers1

1

I thought of generating a hash code, but I understand these are not guaranteed to be the same for each double[] with matching values

Quite the opposite, a hash function is required by design to return equal hashes for equal inputs. For example, 0 is a good starting point for your hash function, returning the value 0 for equal rows. Everything else is just an optimization to try to reduce false positives.

Perhaps this is not possible without just checking all the data

Of course you need to check all the data, how else would you do it?

However your implementation is broken. The default hash function for an array hashes the handle to the array itself, so different instances of arrays with the same data will show up as different. What you want to do is to use a HashCode instance and Add() each element of your array in it to get a proper hash code.

Blindy
  • 65,249
  • 10
  • 91
  • 131
  • Thanks for the input. I was not looking to run these through a hash to compare them on the fly, I was looking to generate a new field to represent the data array uniquely to be used at a later time (reports especially). I had tried a few things but have seen identical hash codes come up for different array data. Admittedly, I don't have a lot of experience with hashing or checksums so I apologize if I may be misinterpreting. Perhaps you are suggesting a cryptographic hash function. – Kerfluffel Jun 02 '21 at 15:38
  • Now you're describing the opposite than in your original question. I'll say it again, a proper hash function has the property that the result is equal for equal inputs. It does not return an unique value (unless you're in a degenerate case), and it does not guarantee different values for different inputs (though if you are good enough at math, or use a proper implementation like `HashCode`, will *generally* be true). – Blindy Jun 02 '21 at 16:10
  • Sorry for the poor wording/descriptions. For "proper hash function has the property that the result is equal for equal inputs" - this wasn't working for me; I was actually getting different values with the same input. However, I found a documented bug in .NET 4.5.2 where this isn't working correctly. It is fixed in 4.6.2 for sure, possibly earlier. I implemented a quick version of the fix for the whole array and these seem to have the same result now. I'll keep working on it - sorry for the confusion. – Kerfluffel Jun 02 '21 at 17:07
  • "I was actually getting different values with the same input" -- I explained that, you're using the wrong hash function. The result is correct, you're just interpreting it wrong. And I don't know what bug you're referring to, hash functions work the same everywhere, they're a mathematical concept. – Blindy Jun 02 '21 at 17:13
  • FYI I added an edit above as it was too long for a comment. It explains some issues I ran across trying to get matching hashes based on hashing the array values together as you had suggested. Thanks for your guidance. – Kerfluffel Jun 02 '21 at 20:33