2

I'm working with a code that is calculating hashes of lists of objects, algorithm was taken from this question: Quick and Simple Hash Code Combinations. Based on second answer values for seed and factor are 1009 and 9176. It works ok for calculating hashes of random lists of integers, but I've found that it simply does not work when lists are similar.

If we create a list of 20 random integers and calculate the hash using:

int[] hashCodes = {
    -1641555406,
    1406166370,
    431811193,
    -719284004,
    -463280747,
    138136561,
    -1634028130,
    -792182888,
    1325264708,
    2143865166,
    25622596,
    -977152280,
    1955313253,
    -1440973864,
    1627089736,
    1733757615,
    -576076691,
    -145918914,
    1015082677,
    -954685337,
    -1307289157
};
int hashCode = 1009;
foreach (var c in hashCodes)
    hashCode = hashCode * 9176 + c;

And than change only the first number:

hashCodes[0] = -145574454;
hashCode = 1009;
foreach (var c in hashCodes)
    hashCode = hashCode * 9176 + c;

we will end up with the same hash code. Result is the same for any random list of integers - if only the first number is different, we end up with the same hash code around 8-10 iteration.

I believe it's due to integer overflow and truncating highest bits, but I'm not sure. I tried using seed and factor based on the first answer (17 and 31 respectively) and it worked fine. Why is that?

How should such hash (hash of list of integers) be calculated?

Edit: As per comment, this is not cryptographically secure hash and it's not used as such, it's just a way to assign unique integer key to lists of integers.

mateuszk
  • 35
  • 4
  • Maybe add a module of a large and prime number? – Yonlif Mar 23 '20 at 10:58
  • Also please add a programming language tag. And lastly I would note that this is not a cryptographically safe hash. – Yonlif Mar 23 '20 at 11:01
  • 2
    The reason is that you're throwing out the lower bits on each round of the loop. These lower bits were obtained from the early entries in the list, which means that you will in essence only keep the effect from the last N numbers. I don't know what the N is, but the way you're combining values you will make early entries ineffective. Instead you should make sure you don't throw out the lower bits. A better (but still not good, you should consider using a better hash algorithm) would be `hashCode = ((hashCode << 27) | (hashCode >> 5)) ^ c;` – Lasse V. Karlsen Mar 23 '20 at 12:26
  • "Unique integer key" and "hash code" don't go together well. The number of possible permutations of 20 32-bit integers is gigantic--much larger than will fit into a 32-bit integer. It's simply not possible to generate a unique 32-bit hash code for every list. – Jim Mischel Mar 23 '20 at 20:02

1 Answers1

2

The reason is that your multiplication part is moving the bits out to the left, and if you have enough loop iterations the bits obtained from the first numbers in the list will eventually be thrown out completely and no longer have an effect on the final result.

The number 9176 can be written in binary as 10001111011000, and in practice the lowest 1-bit will dictate how many rounds you need to run before the first entry completely falls off the list.

The last 1-bit, is at position 3 (or the 4th position from the right), and this means you're moving the bits from the first number 4 positions to the left on every iteration. By the time you've done this 8 times, you've moved that number completely out of the 32-bit buffer (int is 32-bit).

A better method (but see my comment below) would be to at least ensure no bits are completely lost, so a different but still fairly simple way of calculating the hash code could be like this:

hashCode = ((hashCode << 27) | (hashCode >> 5)) ^ c;

This basically rotates the current hash code 27 bits to the left, and the 5 bits that fall off are rotated back in from the right, and then an exclusive OR with c bakes that into the number as well.


You should, however, use a more standardized way of calculating these hashes. My suggested change above is bound to have problems of its own, they're just not as obvious.

And really, because of the pigeon hole principle, you cannot calculate a unique number for a list of numbers, and this has nothing to do with which hash code algorithm you're using. None of them will solve this part of the problem. So I would really ask you to rethink what you're doing in the first place.

Lasse V. Karlsen
  • 380,855
  • 102
  • 628
  • 825
  • Thanks for the answer. It's not my code, just trying to figure out why this happens and how this problem should be solved - totally agree that you can't assign unique integer for every possible list of integers. Could you tell me what "more standardized way of calculating these hashes" did you mean? – mateuszk Mar 23 '20 at 13:12
  • CRC16, CRC32, etc. The already sort of agreed upon, standardized, algorithms for calculating hash codes. I agree that the cryptographic secure algorithms are probably overkill for your needs, but the CRC algorithms are probably safer. – Lasse V. Karlsen Mar 23 '20 at 13:22