2

I need a HashSet for byte arrays in order to check if a given byte array exists in the collection. But it seems like this doesn't work for byte arrays (or perhaps any array).

Here is my test code:

void test()
{
    byte[] b1 = new byte[] { 1, 2, 3 };
    byte[] b2 = new byte[] { 1, 2, 3 };

    HashSet<byte[]> set = new HashSet<byte[]>();
    set.Add(b1);
    set.Add(b2);
    Text = set.Count.ToString();//returns 2 instead of the expected 1.
}

Is there a way to make a HashSet for byte arrays?

NiKiZe
  • 1,256
  • 10
  • 26
ispiro
  • 26,556
  • 38
  • 136
  • 291
  • You're adding two different arrays to the `HashSet`, although they have the same content. You need to use an `EqualityComparer` that compares content instead. – Alejandro Apr 09 '18 at 18:56
  • `HashSet` calls `T.Equals()` to determine if elements are equal, and `Array.Equals()` returns if the references are equal, so `b1.Equals(b1)` is true but `b1.Equals(b2)` is false. You should create a class and implement `Equals`. – Dour High Arch Apr 09 '18 at 18:59
  • 1
    Good hash function depends on what arrays you expect. Suppose you expect small number of arrays each of which is very long. Then hash functions usually proposed, which perform some operation on _all_ bytes in array are pretty bad, and you would better use simple hash function, even trivial one like array.Length might be better in such case. – Evk Apr 09 '18 at 19:08
  • 1
    @Servy: I wouldn't want to use the linked answer for this. While it would *work*, byte arrays have wildly differing distributions from other kinds of lists. – Joshua Apr 09 '18 at 19:12
  • 1
    And if so happens that you store random arrays there (by random I mean that distribution is close to uniform) - then take first 4 bytes, convert to int (with BitConverter) and use that as hash code, because for random arrays chance of collision for that is negligible and you don't waste cpu looping through whole array 2 times. – Evk Apr 09 '18 at 19:50
  • @Evk Thanks. That's a good point. – ispiro Apr 09 '18 at 19:51

1 Answers1

4

Construct a HashSet with an IEqualityComparer<byte[]>. You don't want to use an interface here. While byte[] does in fact implement interfaces such as IEnumerable<byte>, IList<byte>, etc., use of them is a bad idea due to the weightiness involved. You don't use the fact that string implements IEnumerable<char> much at all so don't for byte[] either.

public class bytearraycomparer : IEqualityComparer<byte[]> {
    public bool Equals(byte[] a, byte[] b)
    {
        if (a.Length != b.Length) return false;
        for (int i = 0; i < a.Length; i++)
            if (a[i] != b[i]) return false;
        return true;
    }
    public int GetHashCode(byte[] a)
    {
        uint b = 0;
        for (int i = 0; i < a.Length; i++)
            b = ((b << 23) | (b >> 9)) ^ a[i];
        return unchecked((int)b);
    }
}

void test()
{
    byte[] b1 = new byte[] { 1, 2, 3 };
    byte[] b2 = new byte[] { 1, 2, 3 };

    HashSet<byte[]> set = new HashSet<byte[]>(new bytearraycomparer );
    set.Add(b1);
    set.Add(b2);
    Text = set.Count.ToString();
}

https://msdn.microsoft.com/en-us/library/bb359100(v=vs.110).aspx

If you were to use the answers in proposed duplicate question, you would end up with one function call and one array bounds check per byte processed. You don't want that. If expressed in the simplest way like so, the jitter will inline the fetches, and then notice that the bounds checks cannot fail (arrays can't be resized) and omit them. Only one function call for the entire array. Yay.

Lists tend to have only a few elements as compared to a byte array so often the dirt-simple hash function such as foreach (var item in list) hashcode = hashcode * 5 + item.GetHashCode(); if you use that kind of hash function for byte arrays you will have problems. The multiply by a small odd number trick ends up being rather biased too quickly for comfort here. My particular hash function given here is probably not optimal but we have run tests on this family and it performs quite well with three million entries. The multiply-by-odd was getting into trouble too quickly due to possessing numerous collisions that were only two bytes long/different. If you avoid the degenerate numbers this family will have no collisions in two bytes and most of them have no collisions in three bytes.

Considering actual use cases: By far the two most likely things here are byte strings and actual files being checked for sameness. In either case, taking a hash code of the first few bytes is most likely a bad idea. String's hash code uses the whole string, so byte strings should do the same, and most files being duplicated don't have a unique prefix in the first few bytes. For N entries, if you have hash collisions for the square root on N, you might as well have walked the entire array when generating the hash code, neglecting the fact that compares are slower than hashes.

Joshua
  • 40,822
  • 8
  • 72
  • 132
  • 1
    Thanks. +1. How does one choose a `GetHashCode` implementation? This is quite important here in a `HashSet`. – ispiro Apr 09 '18 at 19:03
  • @ispiro: I keyed in an ok hash code that's similar to ELFHash. If you want to ask about a good one ask another question. – Joshua Apr 09 '18 at 19:05
  • @Joshua : I am quite fascinated with your hashing calculation. is this standard one to follow or came up with your own version? Anyways, +1 – Akash KC Apr 09 '18 at 19:31
  • @ispiro : Thanks.I didn't find Joshua's versioned hashing implementation in posted link – Akash KC Apr 09 '18 at 19:47
  • @AkashKC oops. Sorry. I just quickly saw the `23` and didn't look further. – ispiro Apr 09 '18 at 19:50
  • 1
    @AkashKC: The idea is to smoothly mix the bits without saturating (the conventional wisdom of multiply by an odd number is biased toward 1 bits). Any number between 8 and 24 that is relatively prime to 32 would work. – Joshua Apr 09 '18 at 20:20
  • @AkashKC: I found and fixed a subtle bug in the hash. It turns out this hash is unexpectedly good once the bug is fixed. – Joshua Apr 12 '18 at 20:53
  • @Joshua : Thanks for extending your answer with having more detail on hashing. Appreciated. Still silly question from me , what was initial idea behind `b = ((b << 23) | (b >> 9)) ^ a[i];` ? I didn't find any reference with that hashing implementation so I presumed that you wrote it with your experience. – Akash KC Apr 13 '18 at 04:17
  • @AkashKC: Yeah pretty much my experience. – Joshua Apr 13 '18 at 15:16
  • @Joshua you have a typo in the `GetHashCode`. It should read `Length` instead of `length`. – marko.ristin Jul 08 '22 at 12:13