0

I have a large file and I maintain crc32 checksum over its contents. If a fixed portion of the file were to change either at the start of the file or the end of the file, I can maintain crc32 checksum of the static portion and the dynamic portion and use crc32_combine to efficiently calculate the new whole file checksum. Mark Adler answered it beautifully here: CRC Calculation Of A Mostly Static Data Stream.

But if the content in the middle of the file were to change and not always at a predefined offset (and length), is there a way to efficiently compute the whole file checksum without reading the whole file?

  • You could have a table of CRCs for all "sectors" of your file, and then use the combination strategy suggested by Mark. (where "sectors" can be any unit of size you choose - of course, the smaller the size, the more you will have to compute to combine the result!) – guga Aug 27 '19 at 21:36
  • Thank you Guga. Yes, I can maintain a list or a tree or the CRCs and use crc32_combine. But I was wondering if this could be done more simply. – Nitin Muppalaneni Aug 28 '19 at 16:10

1 Answers1

0

Yes, so long as you know the before and after values of the bytes changed. And their location, of course.

Compute the exclusive-or of the before and after. That is zeros where there are no changes, and non-zero where there are changes. Then compute the raw CRC of the exclusive-or for the entire file, and then exclusive-or the result of that with the CRC.

Presumably you will have a long sequence of zeros, and some non-zero values, and then another long sequence of zeros. You can ignore the initial long sequence and just start computing the CRC of the non-zero values. Then use the same trick in the link to apply the long sequence of zeros after that to the raw CRC.

Mark Adler
  • 101,978
  • 13
  • 118
  • 158