I am using adler32 function from zlib to calculate the weak checksum of a chunk of memory x (4096 in length). Everything is fine, but now I would like to perform the rolling checksum if the chunks from different file do not match. However, I am not sure how to write a function to perform that on the value returned by adler32 in zlib. So if the checksum does not match, how do I calculate rolling checksum by using original checksum, x + 1 byte and x + 4096 + 1? Basically trying to build rsync implementation.
Asked
Active
Viewed 2,503 times
1 Answers
6
Pysync has implemented rolling on top of zlib's Adler32 like this:
_BASE=65521 # largest prime smaller than 65536
_NMAX=5552 # largest n such that 255n(n+1)/2 + (n+1)(BASE-1) <= 2^32-1
_OFFS=1 # default initial s1 offset
import zlib
class adler32:
def __init__(self,data=''):
value = zlib.adler32(data,_OFFS)
self.s2, self.s1 = (value >> 16) & 0xffff, value & 0xffff
self.count=len(data)
def update(self,data):
value = zlib.adler32(data, (self.s2<<16) | self.s1)
self.s2, self.s1 = (value >> 16) & 0xffff, value & 0xffff
self.count = self.count+len(data)
def rotate(self,x1,xn):
x1,xn=ord(x1),ord(xn)
self.s1=(self.s1 - x1 + xn) % _BASE
self.s2=(self.s2 - self.count*x1 + self.s1 - _OFFS) % _BASE
def digest(self):
return (self.s2<<16) | self.s1
def copy(self):
n=adler32()
n.count,n.s1,n.s2=self.count,self.s1,self.s2
return n
But as Peter stated, rsync does not use Adler32 directly, but a faster variant of it.
Code of the rsync tool is bit hard to read, but checkout librsync. It's a completely separate project and it's much more readable. Take a look at rollsum.c
and rollsum.h
. There is an efficient implementation of the variant in C macros:
/* the Rollsum struct type*/
typedef struct _Rollsum {
unsigned long count; /* count of bytes included in sum */
unsigned long s1; /* s1 part of sum */
unsigned long s2; /* s2 part of sum */
} Rollsum;
#define ROLLSUM_CHAR_OFFSET 31
#define RollsumInit(sum) { \
(sum)->count=(sum)->s1=(sum)->s2=0; \
}
#define RollsumRotate(sum,out,in) { \
(sum)->s1 += (unsigned char)(in) - (unsigned char)(out); \
(sum)->s2 += (sum)->s1 - (sum)->count*((unsigned char)(out)+ROLLSUM_CHAR_OFFSET); \
}
#define RollsumRollin(sum,c) { \
(sum)->s1 += ((unsigned char)(c)+ROLLSUM_CHAR_OFFSET); \
(sum)->s2 += (sum)->s1; \
(sum)->count++; \
}
#define RollsumRollout(sum,c) { \
(sum)->s1 -= ((unsigned char)(c)+ROLLSUM_CHAR_OFFSET); \
(sum)->s2 -= (sum)->count*((unsigned char)(c)+ROLLSUM_CHAR_OFFSET); \
(sum)->count--; \
}
#define RollsumDigest(sum) (((sum)->s2 << 16) | ((sum)->s1 & 0xffff))

esamatti
- 18,293
- 11
- 75
- 82
-
And what is ROLLSUM_CHAR_OFFSET in this example and why is it 31? There is nothing like this in hash description. – Alexander Shishenko Jan 10 '15 at 20:26
-
It looks like ROLLSUM_CHAR_OFFSET is just an arbitrary value that's added to every input byte when calculating the sum. While this would "fill up" the range of the checksum quicker, it probably serves no actual purpose because, if two different inputs of the same length have a checksum collision, the offset will not change this at all. – mwfearnley Jul 29 '20 at 10:15