Construct a HashSet
with an IEqualityComparer<byte[]>
. You don't want to use an interface here. While byte[]
does in fact implement interfaces such as IEnumerable<byte>
, IList<byte>
, etc., use of them is a bad idea due to the weightiness involved. You don't use the fact that string
implements IEnumerable<char>
much at all so don't for byte[]
either.
public class bytearraycomparer : IEqualityComparer<byte[]> {
public bool Equals(byte[] a, byte[] b)
{
if (a.Length != b.Length) return false;
for (int i = 0; i < a.Length; i++)
if (a[i] != b[i]) return false;
return true;
}
public int GetHashCode(byte[] a)
{
uint b = 0;
for (int i = 0; i < a.Length; i++)
b = ((b << 23) | (b >> 9)) ^ a[i];
return unchecked((int)b);
}
}
void test()
{
byte[] b1 = new byte[] { 1, 2, 3 };
byte[] b2 = new byte[] { 1, 2, 3 };
HashSet<byte[]> set = new HashSet<byte[]>(new bytearraycomparer );
set.Add(b1);
set.Add(b2);
Text = set.Count.ToString();
}
https://msdn.microsoft.com/en-us/library/bb359100(v=vs.110).aspx
If you were to use the answers in proposed duplicate question, you would end up with one function call and one array bounds check per byte processed. You don't want that. If expressed in the simplest way like so, the jitter will inline the fetches, and then notice that the bounds checks cannot fail (arrays can't be resized) and omit them. Only one function call for the entire array. Yay.
Lists tend to have only a few elements as compared to a byte array so often the dirt-simple hash function such as foreach (var item in list) hashcode = hashcode * 5 + item.GetHashCode()
; if you use that kind of hash function for byte arrays you will have problems. The multiply by a small odd number trick ends up being rather biased too quickly for comfort here. My particular hash function given here is probably not optimal but we have run tests on this family and it performs quite well with three million entries. The multiply-by-odd was getting into trouble too quickly due to possessing numerous collisions that were only two bytes long/different. If you avoid the degenerate numbers this family will have no collisions in two bytes and most of them have no collisions in three bytes.
Considering actual use cases: By far the two most likely things here are byte strings and actual files being checked for sameness. In either case, taking a hash code of the first few bytes is most likely a bad idea. String
's hash code uses the whole string, so byte strings should do the same, and most files being duplicated don't have a unique prefix in the first few bytes. For N entries, if you have hash collisions for the square root on N, you might as well have walked the entire array when generating the hash code, neglecting the fact that compares are slower than hashes.