2

I have an input of 288 bits (comprising 4 × 32-bit identity function outputs and 10 × 16-bit integers). I need to hash this to 96 bits with as few collisions as possible. The goal could be stated as key compression with probabilistic collisions.

I'm aware that CRC is a bijective hash, thus ensuring 100% even distribution (as I understand it). In my view, I should be able to run 3 parallel CRC paths through the input, resulting in a 96-bit lossy hash (obviously not bijective) of optimum distribution.

However, I'm also aware that CRC is not used for such applications. An algorithm such as MetroHash would typically be used.

Could someone explain to me why CRC is a bad (or not) idea for this application?

Note: This is not intended for anything secure.

wovano
  • 4,543
  • 5
  • 22
  • 49
IamIC
  • 17,747
  • 20
  • 91
  • 154
  • 1
    CRC can efficiently be implemented in shift registers+few gates - software implementations are done for compatibility or nostalgia. I'd try a [Fletcher sum](https://en.wikipedia.org/wiki/Fletcher%27s_checksum), even if this is just nine to five terms. Data size being fixed, you can play with "the factors". Note that there are checksums which are problematic for small sizes, Adler32 being one. (Another tag to consider is [tag:checksum].) – greybeard Dec 03 '17 at 11:26

1 Answers1

2

Sure, that can work, but there are probably better approaches.

For it to work, you would need to use three different CRC-32's with three different polynomials. And even then, be careful that they don't have common factors (e.g. x+1), to make sure that there are no correlated bits between the three.

Better would be an approach like used in xxhash, but extended to 96 bits. That would be faster in software.

Why 96 bits? That seems like an unnecessarily long hash.

Mark Adler
  • 101,978
  • 13
  • 118
  • 158
  • Each CRC is seeing every 1-in-3 32-bit words, i.e. 0 1 2 0 1 2 0 1 2. Why would this require different polynomials? The input data is uncommon to all 3. – IamIC Dec 03 '17 at 18:23
  • A 64-bit hash of a mere 24-bit input has a 1 in ~131,072.5 collision probability. At 96 bits, it's 1 in > 562 quadrillion. – IamIC Dec 03 '17 at 18:26
  • @IamIC `Each [CRC32 used for 96 check bits total] is seeing every 1-in-3 32-bit words` I was expecting each CRC seeing all of the input, and claim to be able to find a reason to prefer this as well as to muse about a direct CRC96 - but this is drifting in the direction of a chat or an answer. – greybeard Dec 03 '17 at 23:27
  • The question said nothing about "each CRC is seeing 1-in-3 32-bit words". A good hash will thoroughly mix the input information into all of the bits of the hash. Not one-third of the information to one-third of the hash three times. So to do this right, you'd need three different CRCs, each operating on all of the input. – Mark Adler Dec 04 '17 at 00:38
  • Or some other 96-bit hash. (It could be a 96-bit CRC, but I am not aware of research into good 96-bit polynomials. The work to find good polynomials is unfortunately exponential in their length.) – Mark Adler Dec 04 '17 at 01:09
  • I have no idea what you mean by "A 64-bit hash of a mere 24-bit input". What 24-bit input? If the input is only 24-bits, then a hash longer than that makes no sense, since it can only take on 2^24 possible values. – Mark Adler Dec 04 '17 at 01:09
  • See [this answer](https://stackoverflow.com/a/14217471/1180620) for how to calculate the probability of a collision. – Mark Adler Dec 04 '17 at 02:42
  • Thanks, @MarkAdler. In my view, dividing the data into 3 and using a 32-bit CRC on each partition effectively triples the output resolution to 96 bits. I know that traditionally the input is completely mixed. The result would be only 32 bits in this case. I have the formula for hash collision probability: I mean 2^24 unique inputs (not input length). – IamIC Dec 04 '17 at 07:26
  • 1
    @IamIC: I think that using 3 splits, crc'ing them, and concatenating the crc's is worse than using a 96-bit hash in the first place. – geza Dec 04 '17 at 21:03