3

I am trying to generate some crc32 hashes, but it seems like zlib and binascii use the crc32b algorithm even though their respective functions are simply zlib.crc32 and binascii.crc32. Are there any other python resources for hash generation that I can try? Interestingly, I have previously found that R's 'digest' package also implements crc32b with no mention of crc32.

Some examples of what I mean by CRC32 and CRC32b:

Here you can see both in the dropdown: http://www.md5calc.com/crc32

Here, CRC32b is on the right side: https://hash.online-convert.com/crc32-generator

Here is a php-centered discussion on the distiction: What is the difference between crc32 and crc32b?

Here we can see that python is implemeting CRC32b: How to calculate CRC32 with Python to match online results?

Thank you

Kelley Brady
  • 348
  • 2
  • 10
  • Where are you finding CRCs that are called "crc32a" and "crc32b"? I don't see any CRC's by those names in this [pretty complete menagerie of CRCs](http://reveng.sourceforge.net/crc-catalogue/all.htm). zlib's CRC-32 is always referred to as simply CRC-32. No "a" or "b". – Mark Adler Jun 13 '18 at 16:52
  • Not sure how notifications work here, but I edited my question with these details – Kelley Brady Jun 13 '18 at 17:01

2 Answers2

5

What they are calling "crc32" is the CRC-32/BZIP2 in this catalog. What they are calling "crc32b" is the PKZip CRC-32 (ITU V.42), commonly referred to as simply CRC-32, as it is in that catalog. This use of "crc32" and "crc32b" is apparently a notation invented by the PHP authors.

You can find a set of example hashes on the PHP documentation page for hash(). There the hashes of the string "hello" are calculated, and can be checked against implementations. The catalog I linked uses "123456789" for the checks.

You can easily calculate the BZIP2 CRC yourself. Here is some C code as an example:

uint32_t crc32bzip2(uint32_t crc, void const *mem, size_t len) {
    unsigned char const *data = mem;
    if (data == NULL)
        return 0;
    crc = ~crc;
    while (len--) {
        crc ^= (unsigned)(*data++) << 24;
        for (unsigned k = 0; k < 8; k++)
            crc = crc & 0x80000000 ? (crc << 1) ^ 0x4c11db7 : crc << 1;
    }
    crc = ~crc;
    return crc;
}

If you call that with NULL for the data pointer, it will return the initial value of the CRC, which in this case is zero. Then you can call it with the current CRC and the bytes to update the CRC with, and it will return the resulting CRC.

A Python version that computes the CRC-32/BZIP2 of the bytes from stdin:

#!/usr/local/bin/python3
import sys
a = bytearray(sys.stdin.buffer.read())
crc = 0xffffffff
for x in a:
    crc ^= x << 24;
    for k in range(8):
        crc = (crc << 1) ^ 0x04c11db7 if crc & 0x80000000 else crc << 1
crc = ~crc
crc &= 0xffffffff
print(hex(crc))

crcany will generate more efficient table-based versions (in C) if desired.

Mark Adler
  • 101,978
  • 13
  • 118
  • 158
  • Essentially I am working with an inventory system that uses PHP to generate 'crc32' (as PHP refers to it) hashes once a particular item has been inventoried in a database.The hash is then turned into a barcode and used to label the item. Currently, I would like to preemptively generate hashes for items which i know the human-readable ID (the human-readable IDs are too long for barcoding), but have not yet inventoried for _reasons_ . My PHP knowledge is limited, so I was hoping to use python to let me skip the inventorying step and generate my hash barcodes directly from the human-readable IDs – Kelley Brady Jun 13 '18 at 17:56
  • I'm pretty sure md5calc is using the algorithm **CRC-32/BZIP2** on the page you sent http://reveng.sourceforge.net/crc-catalogue/all.htm . One of the aliases is **CRC-32/AAL5** – Kelley Brady Jun 13 '18 at 18:22
  • Yes, I checked and it is the BZIP2 CRC. – Mark Adler Jun 13 '18 at 22:19
  • Note that ¨PHP crc32 output is in reverse byte order ([source](https://www.php.net/manual/en/function.hash-file.php#104836)). Here is a Python implementation based on @mark-adler answer that reverse the output byte order: https://chezsoi.org/shaarli/?U7admg – Lucas Cimon Sep 04 '19 at 07:46
0

I made some improvements on Mark Adler's answer, it's quicker 20+ times after split data into partitions, but I don't know why.

#!/usr/local/bin/python3
import random
import timeit

def crc32_bzip2(data, precrc=None, bs=None):
    def crc32_bzip2_block(data, precrc=None):
        crc = 0xFFFFFFFF if precrc is None else (precrc ^ 0xFFFFFFFF)
        for x in data:
            crc ^= x << 24
            for k in range(8):
                if crc & 0x80000000:
                    crc = (crc << 1) ^ 0x04C11DB7
                else:
                    crc = crc << 1
        crc = ~crc
        crc &= 0xFFFFFFFF
        return crc

    crc = None
    bs = bs if bs else len(data)
    blocks = [data[i:i+bs] for i in range(0, len(data), bs)]
    for b in blocks:
        crc = crc32_bzip2_block(b, crc)
    return crc


# testing
bs = 512
datasize = 1024 * 50
data = bytearray(random.getrandbits(8) for _ in range(datasize))

number = 1
setup = 'from __main__ import crc32_bzip2, data, bs'
a = timeit.timeit('crc32_bzip2(data)', setup=setup, number=number)
b = timeit.timeit('crc32_bzip2(data, bs=bs)', setup=setup, number=number)

print(f'{a:.3}', f'{b:.3}', f'{a/b:.3}', sep='\t')
# 3.66  0.127   28.8, on the environment:
#    Intel i5-6300U CPU notebook
#    Python 3.6.6 64bit
#    Windows 7 SP1 64bit
Keelung
  • 349
  • 5
  • 9