Differences in calculation of adler32 rolling checksum - python

Question

Need a clarification while looking at calculating a running checksum.

Assume I have data like this.

data = 'helloworld'

Assuming a blocksize of 5, I need to calculate running checksum.

>>> zlib.adler32('hello')
103547413
>>> zlib.adler32('ellow')
105316900

According to Python documentation (python version 2.7.2)

zlib.adler32(data[, value])

"Computes a Adler-32 checksum of data. (An Adler-32 checksum is almost as reliable as a CRC32 but can be computed much more quickly.) If value is present, it is used as the starting value of the checksum; otherwise, a fixed default value is used. This allows computing a running checksum over the concatenation of several inputs."

But when I provide something like this,

>>> zlib.adler32('ellow', zlib.adler32('hello'))
383190072

The output is entirely different.

I tried creating a custom function to generate the rolling checksum as defined in the rsync algorithm.

def weakchecksum(data):
    a = 1
    b = 0

    for char in data:
        a += (ord(char)) % MOD_VALUE
        b += a % MOD_VALUE



    return (b << 16) | a



def rolling(checksum, removed, added, block_size):
    a = checksum
    b = (a >> 16) & 0xffff
    a &= 0xffff

    a = (a - ord(removed) + ord(added)) % MOD_VALUE
    b = (b - (block_size * ord(removed)) + a) % MOD_VALUE

    return (b << 16) | a

Here is the values that I get from running these functions

Weak for hello: 103547413
Rolling for ellow: 105382436
Weak for ellow: 105316900

As you can see there is some huge difference in my implementation of rolling checksum and python's, in terms of value.

Where am I going wrong in calculating the rolling checksum? Am I making use of the rolling property of python's adler32 function correctly?

score 8 · Accepted Answer · edited Aug 03 '12 at 15:04

8

The adler32() function does not provide "rolling". The documentation correctly uses the word "running" (not "rolling"), which means simply that it can compute the adler32 in chunks as opposed to all at once. You need to write your own code to do compute a "rolling" adler32 value, which would be the adler32 of a sliding window over the data.

edited Aug 03 '12 at 15:04

Martin Thompson

16,395
1
38
56

answered Mar 14 '12 at 19:17

Mark Adler

101,978
13
118
158

JasonDong · Answer 2 · 2013-11-27T01:59:12.897

In your method "rolling",the

b = (b - (block_size * ord(removed)) + a) % MOD_VALUE

should be

b = (b - (block_size * ord(removed)) + a - 1) % MOD_VALUE

According the explain of adler32 algorithm in Wikipedia, we can see:

A = 1 + D1 + D2 + ... + Dn (mod 65521)
B = (1 + D1) + (1 + D1 + D2) + ... + (1 + D1 + D2 + ... + Dn) (mod 65521)
  = n×D1 + (n−1)×D2 + (n−2)×D3 + ... + Dn + n (mod 65521)

Adler-32(D) = B × 65536 + A

When we rolling checksum, we will have the equations:

A1 = (1 + D2 + D3 + … + Dn + Dn+1)(mod 65521)
= (1 + D1 + D2 + D3 + … + Dn) – D1 + Dn+1(mod 65521)
= A – D1 + Dn+1(mod 65521)
B1 = (1 + D2) + (1 + D2 + D3) + … + (1 + D2 + D3 + … + Dn + Dn+1)(mod 65521)
= (1 + D1) – D1 – 1 + (1 + D1 + D2) – D1 + ... +(1 + D1 + D2 + … + Dn) – D1 + (1 + D1 + D2 +      … + Dn + Dn+1) – D1(mod 65521)
= B – nD1 – 1 + A1 + D1 – D1(mod 65521)
= B – nD1 + A1 – 1(mod 65521)

score 4 · Answer 3 · answered Mar 15 '12 at 18:32

4

By the way, your def rolling() is correct, at least for Python where the sign of the modulo result has the sign of the divisor. It might not work in other languages, where for example in C the sign of the result of % is either the sign of the dividend or is implementation defined.

You can make your algorithm more efficient by considering how far from modulo 65521 you can get at each step, and either replacing the % with if's and additions or subtractions of 65521, or use large enough data types to let it go for a while and figure out how infrequently you can get away with a % on the sums to avoid overflowing. Again, be careful with % on negative dividends.

answered Mar 15 '12 at 18:32

Mark Adler

101,978
13
118
158

Thanks for your additional comments, Mark. – prabhu Apr 08 '12 at 05:41
I tried with prime 65521 and got calculation errors in my rolling checksum procedure implementation (the change was or wasn't detected). Everything is fine if I use 2^16. I hope I will be able to come back to this problem some time later and exclude the possibility of programming error bringing some useful information on the topic in the same time. – 4pie0 Mar 25 '17 at 09:50

score 1 · Answer 4 · answered Oct 20 '13 at 12:59

1

Here is the working function. Please notice at what step the MOD is calculated.

def myadler32(data):
  a = 1
  b = 0
  for c in data:
      a += c
      b += a
  a %= MOD_ADLER
  b %= MOD_ADLER
  return b<<16 | a

answered Oct 20 '13 at 12:59

Alexandre Kandalintsev

83
5

score 0 · Answer 5 · answered Mar 14 '12 at 09:45

0

I believe you've mis-calculated the adler32 value in your testing:

>>> import zlib
>>> zlib.adler32("helloworld")
389415997
>>> zlib.adler32("world",zlib.adler32("hello"))
389415997

answered Mar 14 '12 at 09:45

sarnold

102,305
22
181
238

Thanks. But, I guess I'm looking for differences in case of rolling checksums. In your case, what I get is the checksum of 'world', and what I'm interested is, calculating the checksum of 'ellow' using the checksum of 'hello'. The difference between the two being that 'h' is removed and 'w' is added. Let me know if I'm not clear. – Liju Mathew Mar 14 '12 at 12:14

Differences in calculation of adler32 rolling checksum - python

5 Answers5

Linked