1

So I have an object that will ether be stored in a database table or found in a file; used for updating that table. We need to do comparisons between the table and the update file to avoid duplicates when we update.

My first attempt to solve the problem is to do a string.join on the fields then convert that to bytes, and finally md5 hash that byte array. The problem is that we get sometimes we would get an empty string when some (but not all) of the fields were null.

So the second way we decided to do it was to just serialize the object into bytes and md5 hash the string of that. So far that's worked fine but it was brought to my attention that it could be unstable (if someone updates .net versions for example).

Is this something i need to worry about?

Example code for those who want it:

public void GenerateHash()
    {
        md5 = returnHash();
    }

    public byte[] returnHash()
    {
        if (this == null)
            return null;
        BinaryFormatter bf = new BinaryFormatter();
        MemoryStream ms = new MemoryStream();
        bf.Serialize(ms, this);
        string str = System.Text.Encoding.Default.GetString(ms.ToArray());
        return SensitiveNamespace.Hashing.MD5(str).ToBytes();
    }
marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
Mr. MonoChrome
  • 1,383
  • 3
  • 17
  • 39
  • You should not use a hash for equality comparison without subsequently comparing the unhashed data. If the hash is different, the objects will be different, but if the hash is the same, there is a chance that the objects are still different. – phoog Aug 20 '14 at 17:26
  • that is a somewhat fair point, but we are ok with the probability of md5 collisions. This post shows the overwhelmingly slight odds of an md5 collision: http://stackoverflow.com/questions/201705/how-many-random-elements-before-md5-produces-collisions – Mr. MonoChrome Aug 20 '14 at 17:28
  • You're probably right about that, but you should also bear in mind that you're not actually talking about random hashes. Rather, they are all coming from similarly-structured data. I don't know whether that would make the odds of a collision higher, lower, or the same, but I would suspect it makes them higher. – phoog Aug 20 '14 at 17:30
  • @Ukemi *BinaryFormatter* stores your assembly's type+version in the serialized data. If you upgrade your code to a new version you'll not get the same binary data. Therefore I would use Xml or Json as serialization format. – L.B Aug 20 '14 at 17:50
  • @L.B thats a good point. Alright ill do just that. Ill give you answer if you make it. – Mr. MonoChrome Aug 20 '14 at 17:56

2 Answers2

2

BinaryFormatter stores your assembly's type+version in the serialized data. If you upgrade your code to a new version you'll not get the same binary data. Therefore I would use Xml or Json as serialization format.

For example: (Using Json.Net)

byte[] GenerateHash(object o)
{
    using (var sha = SHA256.Create())
    {
        var json = JsonConvert.SerializeObject(o);
        return sha.ComputeHash(Encoding.UTF8.GetBytes(json));
    }
}

BTW: You can reduce the chance of collision by using SHA256

L.B
  • 114,136
  • 19
  • 178
  • 224
  • I used you answer, with success. But i would like to note here that i have many rows, and when i have to do the comparisons speed matters. MD5 hash is 4 times faster than SHA256 and is twice as fast as SHA0. I would take the sha advice if not for the hilariously low chance of an md5 collision. Still the advice is useful for others and if need be i might use it in the future. – Mr. MonoChrome Aug 20 '14 at 19:48
  • I would +1 again for the Json.net reference if i could. – Mr. MonoChrome Aug 20 '14 at 19:49
0

it was brought to my attention that it could be unstable (if someone updates .net versions for example).

Is this something i need to worry about?

What are you comparing the hash against? Are you persisting the hash values of your database data? If not, that is, if you're calculating them at run time, there shouldn't be a problem.

If so, you can put in place some kind of validation job that runs whenever the application is started up that validates the hashes and changes them if necessary.

Since the part of the puzzle that you don't control is the serialization code, maybe you should go back to the string-concatenation approach and include some combination of fields that is guaranteed to be unique.

phoog
  • 42,068
  • 6
  • 79
  • 117
  • Im not sure why downvote, this is a fair answer so +1. That said it wont work for me, as the amount of rows we are talking about makes this unpreferable(not impossible). – Mr. MonoChrome Aug 20 '14 at 17:31
  • @Ukemi In that case, you have to figure what happens if the serialization code changes and the hashes all change. Maybe the consequence is that your application runs slower for a little while as it needlessly resaves objects that haven't changed. Maybe that's acceptable. – phoog Aug 20 '14 at 17:34