8

What is the best hash method for an array of byte?

The arrays are serialized class objects containing jpeg image passed between applications over TCP/IP.

The array size is about 200k.

casperOne
  • 73,706
  • 19
  • 184
  • 253
Chesnokov Yuriy
  • 1,760
  • 5
  • 21
  • 35

4 Answers4

11

Any of the built-in hashing functions should do; depending on how much you care about collisions these are your options (from most collisions to least):

  • MD5
  • SHA1
  • SHA256
  • SHA384
  • SHA512

They are as simple to use as:

var hash = SHA1.Create().ComputeHash(data);

Bonus Marks: If you don't care about security (which I don't think you do given that you are getting the hashes for images) you might want to look into Murmur hash, which is designed for content hashing and not secure hashing (and is thus much faster). It isn't, however, in the framework so you will have to find an implementation (and you should probably go for Murmur3).

Edit: If you are looking for a HASHCODE for a byte[] array it's entirely up to you, it usually consists of bit shifting (by primes) and XORing. E.g.

public class ByteArrayEqualityComparer : IEqualityComparer<byte[]>
{
    public static readonly ByteArrayEqualityComparer Default = new ByteArrayEqualityComparer();
    private ByteArrayEqualityComparer() { }

    public bool Equals(byte[] x, byte[] y)
    {
        if (x == null && y == null)
            return true;
        if (x == null || y == null)
            return false;
        if (x.Length != y.Length)
            return false;
        for (var i = 0; i < x.Length; i++)
            if (x[i] != y[i])
                return false;
        return true;
    }

    public int GetHashCode(byte[] obj)
    {
        if (obj == null || obj.Length == 0)
            return 0;
        var hashCode = 0;
        for (var i = 0; i < obj.Length; i++)
            // Rotate by 3 bits and XOR the new value.
            hashCode = (hashCode << 3) | (hashCode >> (29)) ^ obj[i];
        return hashCode;
    }
}
// ...
var hc = ByteArrayEqualityComparer.Default.GetHashCode(data);

EDIT: If you want to validate that the value hasn't changed you should use CRC32.

Jonathan Dickinson
  • 9,050
  • 1
  • 37
  • 60
  • thank you for the answer, I need quick `byte[]` array content comparison only, there is no need for encripted hashes. I need to make sure sent data remains the same as it is recieved on the other end – Chesnokov Yuriy Aug 30 '11 at 13:55
  • @Chesnokov so why didn't you ask for that in the first place? – Jonathan Dickinson Aug 30 '11 at 13:59
  • I meant comparison by hash value, as in the question, data is sent over internet along with a hash. On the other end hash is recomputed and compared to make sure there were no modifications on the data during transfer – Chesnokov Yuriy Aug 30 '11 at 14:02
  • @Chesnokov - I added a link for CRC32, which is what you should be using. – Jonathan Dickinson Aug 30 '11 at 14:08
5

Jon Skeet has a good answer on how to override GetHashCode, which is based on general effective hash techniques where you start with a prime number, add it to the hash codes of the components multiplied by another prime number, allowing for overflow.

For your case, you would do:

static int GetByteArrayHashCode(byte[] array)
{
    unchecked
    {
        int hash = 17;

        // Cycle through each element in the array.
        foreach (var value in array)
        {
            // Update the hash.
            hash = hash * 23 + value.GetHashCode();            
        }

        return hash;
    }
}

Note in Jon's answer he goes into why this is better than XORing the hashes of the individual elements (and that anonymous types in C# currently do not XOR the hashes of the individual elements, but use something similar to the above).

While this will be faster than the hash algorithms from the System.Security.Cryptography namespace (because you are dealing with smaller hashes), the downside is that you might have more collisions.

You would have to test against your data and determine how often you are getting collisions vs. the work that needs to be done in the case of a collision.

Community
  • 1
  • 1
casperOne
  • 73,706
  • 19
  • 184
  • 253
  • Is `foreach` slower than `for`? Also, no need to call `GetHashCode` on `byte` as it just returns its value cast to `int`. – Drew Noakes Nov 15 '16 at 12:30
  • @DrewNoakes Pretty sure that the compiler changes `foreach` on arrays to `for`. That's an implementation detail though, and generally you should test if you see this is a bottleneck. Also, same with the return value of `GetHashCode` for byte. – casperOne Nov 15 '16 at 14:45
4

Based on the Compiler Generated GetHashCode()

public static int GetHashCode(byte[] array) {
    unchecked {
        int i = 0;
        int hash = 17;
        int rounded = array.Length & ~3;

        hash = 31 * hash + array.Length;

        for (; i < rounded; i += 4) {
            hash = 31 * hash + BitConverter.ToInt32(array, i);
        }

        if (i < array.Length) {
            int val = array[i];
            i++;

            if (i < array.Length) {
                val |= array[i] << 8;
                i++;

                if (i < array.Length) {
                    val |= array[i] << 16;
                }
            }

            hash = 31 * hash + val;
        }

        return hash;
    }
}

Ah... and a link to C# Murmurhash http://landman-code.blogspot.com/2009/02/c-superfasthash-and-murmurhash2.html

Community
  • 1
  • 1
xanatos
  • 109,618
  • 12
  • 197
  • 280
2

Any of the crypto hashing stuff should work. Not sure about the speed. Perhaps MD5?

leppie
  • 115,091
  • 17
  • 196
  • 297
  • are there any custom methods in .NET for byte[] array quick comparison only, I do not need encryption yet – Chesnokov Yuriy Aug 30 '11 at 13:50
  • @Chesnokov that sounds like a different question; like: http://stackoverflow.com/questions/43289/comparing-two-byte-arrays-in-net – Jonathan Dickinson Aug 30 '11 at 13:54
  • oh, no. I need a fast method to get 32 bit value for `byte[]` array. The serialized object is sent along with its hash to other machine where hash is recomputed and compared – Chesnokov Yuriy Aug 30 '11 at 13:58