20

I am looking for a checksum algorithm where for a large block of data the checksum is equal to the sum of checksums from all the smaller component blocks. Most of what I have found is from RFCs 1624/1141 which do provide this functionality. Does anyone have any experience with these checksumming techniques or a similar one?

Steve Severance
  • 6,611
  • 1
  • 33
  • 44
  • 4
    Does it need to specifically be equal to the arithmetic sum of the checksums of the smaller blocks, or do you just more generally want to be able to calculate the checksum of the large block from the checksums of the smaller blocks? – John Kugelman Jul 23 '09 at 18:08
  • It seems to me, the checksumming problem is today considered "mostly solved" and often dismissed as "IO bound" and hence not interesting from an algorithmic performance point of view. But OK, "IO bound" it is, what can we do about IO? Well, if we could calculate hashes while IO is still hot in the caches, that would be nice. – Prof. Falken Sep 15 '10 at 15:21
  • 3
    @Amigable Clark Kent Perhaps it would have been better to open a new question, with your exact requirements, instead of piggy-backing off of an existing answered one. Are you just looking for a checksum to detect errors? Are you looking for a cryptographic hash function? Is it required that the checksum be composed of some combination of the checksums of each block, as the original question asks, or is it only required that you can compute the checksum incrementally on a data stream, and give a checksum for the whole stream once you're done? – Brian Campbell Sep 21 '10 at 21:59
  • @Brian Campbell, I am more after a hash thing, for file identity purposes. To me and my current project, the most important thing is that the checksum can be computed incrementally. But I thought the original question was very intriguing in its own and I hoped other people with lots of clues would write something interesting about it. :-) – Prof. Falken Sep 22 '10 at 12:57
  • @Amigable Clark Kent Looks like you awarded your bounty before I managed to finish posting my answer. Hope I could help you out anyhow. – Brian Campbell Sep 22 '10 at 14:59
  • @Amigable Clark Kent: Please open up your own question with precise requirements. Nearly every hash can be computed "incrementally" as most of them are a form of iterated hash function. The quasi standard hash function API is made up out of the functions `reset`, `update` and `finalize` where `update` accepts a new chunk data and updates the current state. Requirements: How many bits? 32? 128? Cryptographically secure yes or no? etc etc etc – sellibitze Sep 24 '10 at 08:56

3 Answers3

13

If it's just a matter of quickly combining the checksums of the smaller blocks to get to the checksums of the larger message (not necessarily by a plain summation) you can do this with a CRC-type (or similar) algorithm.

The CRC-32 algorithm is as simple as this:

uint32_t update(uint32_t state, unsigned bit)
{
    if (((state >> 31) ^ bit) & 1) state = (state << 1) ^ 0x04C11DB7;
    else                           state = (state << 1);
    return state;
}

Mathematically, the state represents a polynomial over the field GF2 that is always reduced modulo the generator polynomial. Given a new bit b the old state is transformed into the new state like this

state --> (state * x^1 + b * x^32) mod G

where G is the generator polynomial and addition is done in GF2 (xor). This checksum is linear in the sense that you can write the message M as a sum (xor) of messages A,B,C,... like this

  10110010 00000000 00000000 = A =    a     00000000 00000000
  00000000 10010001 00000000 = B = 00000000    b     00000000
  00000000 00000000 11000101 = C = 00000000 00000000    c
-------------------------------------------------------------
= 10110010 10010001 11000101 = M =    a        b        c

with the following properties

         M  =          A  +          B  +          C
checksum(M) = checksum(A) + checksum(B) + checksum(C)

Again, I mean the + in GF2 which you can implement with a binary XOR.

Finally, it's possible to compute checksum(B) based on checksum(b) and the position of the subblock b relative to B. The simple part is leading zeros. Leading zeros don't affect the checksum at all. So checksum(0000xxxx) is the same as checksum(xxxx). If you want to compute the checksum of a zero-padded (to the right -> trailing zeros) message given the checksum of the non-padded message it is a bit more complicated. But not that complicated:

zero_pad(old_check_sum, number_of_zeros)
  := ( old_check_sum *  x^{number_of_zeros}        ) mod G
   = ( old_check_sum * (x^{number_of_zeros} mod G) ) mod G

So, getting the checksum of a zero-padded message is just a matter of multiplying the "checksum polynomial" of the non-padded message with some other polynomial (x^{number_of_zeros} mod G) that only depends on the number of zeros you want to add. You could precompute this in a table or use the square-and-multiply algorithm to quickly compute this power.

Suggested reading: Painless Guide to CRC Error Detection Algorithms

sellibitze
  • 27,611
  • 3
  • 75
  • 95
10

I have only used Adler/Fletcher checksums which work as you describe.

There is a nice comparison of crypto++ hash/checksum implementations here.

Rob Elliott
  • 1,998
  • 19
  • 25
  • I've been trying to play with Adler and I can't seem to get the expected result, do you mean: `adler('wiki') + adler('pedia')` should produce the same digest as `adler('wikipedia')`, or am I missing something? – Alix Axel Sep 19 '12 at 15:19
4

To answer Amigable Clark Kent's bounty question, for file identity purposes you probably want a cryptographic hash function, which tries to guarantee that any two given files have an extremely low probability of producing the same value, as opposed to a checksum which is generally used for error detection only and may provide the same value for two very different files.

Many cryptographic hash functions, such as MD5 and SHA-1, use the Merkle–Damgård construction, in which there is a computation to compress a block of data into a fixed size, and then combine that with a fixed size value from the previous block (or an initialization vector for the first block). Thus, they are able to work in a streaming mode, incrementally computing as they go along.

Brian Campbell
  • 322,767
  • 57
  • 360
  • 340