CRC32 hash collision on the same string for any seed

Question

I tried to find seed to hash short strings of lowercase letters of maximum possible length without collisions. I chose SSE 4.2 CRC32 to make the task easier. For lengths 4, 5, 6 there is no collision for seeds up to some reasonable small value (I can't wait infinitely).

#include <bitset>
#include <limits>
#include <iterator>
#include <iostream>

#include <x86intrin.h>

static std::bitset<size_t(std::numeric_limits<uint32_t>::max()) + 1> hashes;

static void findSeed()
{
    uint8_t c[7];
    const auto findCollision = [&] (uint32_t seed)
    {
        std::cout << "seed = " << seed << std::endl;
        hashes.reset();
        for (c[0] = 'a'; c[0] <= 'z'; ++c[0]) {
            uint32_t hash0 = _mm_crc32_u8(~seed, c[0]);
            for (c[1] = 'a'; c[1] <= 'z'; ++c[1]) {
                uint32_t hash1 = _mm_crc32_u8(hash0, c[1]);
                for (c[2] = 'a'; c[2] <= 'z'; ++c[2]) {
                    uint32_t hash2 = _mm_crc32_u8(hash1, c[2]);
                    for (c[3] = 'a'; c[3] <= 'z'; ++c[3]) {
                        uint32_t hash3 = _mm_crc32_u8(hash2, c[3]);
                        for (c[4] = 'a'; c[4] <= 'z'; ++c[4]) {
                            uint32_t hash4 = _mm_crc32_u8(hash3, c[4]);
                            for (c[5] = 'a'; c[5] <= 'z'; ++c[5]) {
                                uint32_t hash5 = _mm_crc32_u8(hash4, c[5]);
                                for (c[6] = 'a'; c[6] <= 'z'; ++c[6]) {
                                    uint32_t hash6 = _mm_crc32_u8(hash5, c[6]);
                                    if (hashes[hash6]) {
                                        std::cerr << "collision at ";
                                        std::copy(std::cbegin(c), std::cend(c), std::ostream_iterator<uint8_t>(std::cerr, ""));
                                        std::cerr << " " << hash6 << '\n';
                                        return;
                                    }
                                    hashes.set(hash6);
                                }
                            }
                        }
                    }
                }
            }
            std::cout << "c[0] = " << c[0] << std::endl;
        }
    };
    for (uint32_t seed = 0; seed != std::numeric_limits<uint32_t>::max(); ++seed) {
        findCollision(seed);
    }
    findCollision(std::numeric_limits<uint32_t>::max());
}

int main()
{
    findSeed();
}

It is clear, that for strings of length 7 it is impossible to find such a seed, because ('z' - 'a' + 1)^7 = 26^7 = 8 031 810 176 > 4 294 967 296 = size_t(std::numeric_limits<uint32_t>::max()) + 1. But notable thing is that for strings abfcmbk and baabaaa for any seed there is first collision. hash6 differs for different seeds when collision occured. It is curious on my mind.

How can it be explained?

Holy nesting Batman. Wow. Why would you ever do something like that? — Jesper Juhl, Sep 26 '20 at 19:31
If your question is about formatting, then answer is "it is simplier to edit and add another level of nesting for string of length = length + 1". The code is just for quick checking of hypothesis. — Tomilov Anatoliy, Sep 26 '20 at 19:35
@TomilovAnatoliy How did you compile this? Is it C++20? All my attempts stop with *`'cbegin' is not a member of 'std'`* and *`'cend' is not a member of 'std'`* — Wolf, Aug 25 '21 at 08:29
See [another answer](https://stackoverflow.com/a/29174491/2932052) for a CRC-32C (alias *iSCSI*) reference implementation. (that's what `_mm_crc32_u8` does) — Wolf, Aug 25 '21 at 08:51
@TomilovAnatoliy Thanks for letting me know. But I'm interested in further exploration of that issue, and so I need something I can build on locally. For this, the [other implementation](https://stackoverflow.com/a/29174491/2932052) seems promising, it helps me to confirm that `abfcmbk` and `baabaaa` give same CRC-32C values for several seeds. — Wolf, Aug 25 '21 at 12:21

score 6 · Accepted Answer · answered Sep 26 '20 at 19:40

If CRC(seed,dat) is the CRC of dat, using the specified seed, then for any seeds (seed1, seed2), and matching-length pair of data (dat1, dat2), and given CRC(seed1,dat1), one can compute CRC(seed2,dat1) by computing the xor of CRC(seed1, dat1), CRC(seed1,dat2), and CRC(seed2,dat2).

This in turn implies that if two pieces of data would yield the same CRC value for any particular seed, they would yield the same value for every possible seed. If for any seed1, CRC(seed1,dat1a) equals CRC(seed1,dat1b), and the strings are of equal length, then for any other seed seed2 and same-length data dat2, CRC(seed2,dat1a) will equal CRC(seed1, dat1a) xor CRC(seed1,dat2) xor CRC(seed2,dat2), and CRC(seed2,dat1b) will equal CRC(seed1, dat1b) xor CRC(seed1,dat2) xor CRC(seed2,dat2). Since all three terms of the xors are equal, that implies that the results will be likewise equal.

Mark Adler · Answer 2 · 2020-09-27T23:14:02.137

As noted in another answer, a CRC can't help with this. Instead you should simply encode your six or fewer lower case letters into base 26 32-bit integers, with some offsets depending on the length of the string. The sum of 26^n for n=0 to 6 is less than 2^32. Much less actually, as that can be encoded in 29 bits. Or as Peter Cordes commented, in 30 bits with six five-bit fields.

There will be no collisions. If it's useful, you can apply a 32-bit CRC to that integer to scramble the bits, and there will again be no collisions.

As you observed, it is not possible to uniquely encode seven or more lower-case characters in 32 bits.

5 bits per letter x 6 = 30 bits: you can get 6 letters into one 32-bit integer with easy-to-unpack 5-bit fields. — Peter Cordes, Sep 27 '20 at 17:45

CRC32 hash collision on the same string for any seed

2 Answers2

Linked