0

I have an object with 9 properties: integer, string, decimal, guid.

I have an array of this object, 2 million records in length.

I'd like to create a string hash/checksum of the properties of this object that won't collide with any other records.

What's the best way to do this in C#? I thought about concat and md5 but concat may produce collisions if I have a=1 b=12 or a=11 b=2 they'd both concat to 112.

Edit: Maybe CHECKSUM is a better word? I just need to do fast comparisons but not direct object comparisons, I have to do value comparisons of every field.

NibblyPig
  • 51,118
  • 72
  • 200
  • 356
  • Have you considered using a unique key? – rory.ap Feb 02 '15 at 13:08
  • Do each of your objects have a unique set of properties? – paul Feb 02 '15 at 13:10
  • Guid already is globally unique, I guess? Unless the same Guid is reused in multiple objects. – kennyzx Feb 02 '15 at 13:10
  • Since hash value is `Int32` (if you're looking for GetHashCode implementation) you have slightly more than `4e9` possibilties, and there're far more possibilities for your classes to collide; that's why according to *Dirichlet princpile* there's no such hash – Dmitry Bychenko Feb 02 '15 at 13:12
  • I need to use it for comparison, the resulting 'hash' can be any length within reason as long as I can compare two objects. – NibblyPig Feb 02 '15 at 13:14
  • concerning collision: http://stackoverflow.com/questions/297960/hash-collision-what-are-the-chances – Theolodis Feb 02 '15 at 13:14
  • Guid is just a property pointing to something else (foreign key essentially), multiple records may have the same guid (just included it as it's a 'type') – NibblyPig Feb 02 '15 at 13:15
  • @SLC, what you want is overriding `Equals`, you don't need hashing if you just need to compare objects. – kennyzx Feb 02 '15 at 13:16
  • Include array index in your hash. – Amit Feb 02 '15 at 13:19
  • _"concat may produce collisions if I have a=1 b=12 or a=11 b=2"_ - and there's nothing you could think of to prevent that? – CodeCaster Feb 02 '15 at 13:24
  • 3
    Hashes **always** collide, even the strongest cryptographic ones. They guarantee that objects with different hashes can never be equal, not the opposite. They aren't keys, they are a way to avoid expensive value comparisons. If you want to override GetHash, just XOR the hashes of the object's fields – Panagiotis Kanavos Feb 02 '15 at 13:24
  • I don't want to compare objects within c# I just need a string that can be used for comparison. You can do this in SQL to some degree with BINARY_CHECKSUM but I can't find a c# equivalent. I will investigate this XOR thing it sounds like it might be what I want? The least efficient way of doing what i want is to concat everything with a delimiter, but I am searching for something that will result in a shorter output that I can do faster comparisons. Maybe hashing isn't the right word? – NibblyPig Feb 02 '15 at 13:34
  • Google also tells me with md5 I'd have to hash 6 billion files per second for 100 years to produce a collision on average. So I'm unsure about why the non-uniqueness is being emphasised so much... my dataset would never exceed a few million records total. – NibblyPig Feb 02 '15 at 13:59
  • IMHO, If strictly accurate matching being needed, I would never rely on any algorithm producing collisions (not even with a low probability as MD5). The problem here is that I would never know which values would produce such collisions, and therefore I would never know whether those values are inside the collection or not. – Rubidium 37 Feb 02 '15 at 14:27

1 Answers1

2

I have to do value comparisons of every field

If that is your final need, no computed value can avoid to compare single fields, unless that computed value is unique for any combination of field values, like (as an example) a string obtained concatenating the values of all the field of each object, but remember that converting to string some values can cause approximation and thus lead to wrong mismatch between objects (especially with floating point numbers).

Field by field comparison is the most accurate one can need, whereas hash/checksum computation is not meant for accurate comparison, but only for indexing, or as preliminary check to avoid more intensive computations (like yours), or other goals where field by field is not required.

You can eventually write a readonly property that compute the value once, only when needed, and store it as an hidden field, like:

    public class _Object
    {
        public Int32 IntField;
        public String StringField;
        public Decimal DecimalField;
        public Guid GuidField;

        private string m_UniqueKey;
        [DebuggerBrowsable(DebuggerBrowsableState.Never)]
        public string UniqueKey
        {
            get
            {
                if (m_UniqueKey == null)
                {
                    m_UniqueKey = IntField.ToString()
                                + "|" + (StringField ?? string.Empty)
                                + "|" + DecimalField.ToString("F6", CultureInfo.InvariantCulture)
                                + "|" + GuidField.ToString("X");
                }
                return m_UniqueKey;
            }
        }
    }

The code sample above computes m_UniqueKey only once (if it is null) and uses an arbitrary character as a separator between field values. It also try to format the decimal value to an arbtrary chosen precision.

In the case you need an hash/checksum value, you can try to implement GetHashCode() and rely on it, but also in this case you should include all important fields, or part of them.

Regards,
Daniele.

Rubidium 37
  • 661
  • 6
  • 12