Using CRCs as a digest to detect duplicates among files

Question

The primary use of CRCs and similar computations (such as Fletcher and Adler) seems to be for the detection of transmission errors. As such, most studies I have seen seem to address the issue of the probability of detecting small-scale differences between two data sets. My needs are slightly different.

What follows is a very approximate description of the problem. Details are much more complicated than this, but the description below illustrates the functionality I am looking for. This little disclaimer is intended to ward of answers such as "Why are you solving your problem this way when you can more easily solve it this other way I propose?" - I need to solve my problem this way for a myriad of reasons that are not germane to this question or post, so please don't post such answers.

I am dealing with collections of data sets (size ~1MB) on a distributed network. Computations are performed on these data sets, and speed/performance is critical. I want a mechanism to allow me to avoid re-transmitting data sets. That is, I need some way to generate a unique identifier (UID) for each data set of a given size. (Then, I transmit data set size and UID from one machine to another, and the receiving machine only needs to request transmission of the data if it does not already have it locally, based on the UID.)

This is similar to the difference between using CRC to check changes to a file, and using a CRC as a digest to detect duplicates among files. I have not seen any discussions of the latter use.

I am not concerned with issues of tampering, i.e. I do not need cryptographic strength hashing.

I am currently using a simple 32-bit CRC of the serialized data, and that has so far served me well. However, I would like to know if anyone can recommend which 32-bit CRC algorithm (i.e. which polynomial?) is best for minimizing the probability of collisions in this situation?

The other question I have is a bit more subtle. In my current implementation, I ignore the structure of my data set, and effectively just CRC the serialized string representing my data. However, for various reasons, I want to change my CRC methodology as follows. Suppose my top-level data set is a collection of some raw data and a few subordinate data sets. My current scheme essentially concatenates the raw data and all the subordinate data sets and then CRC's the result. However, most of the time I already have the CRC's of the subordinate data sets, and I would rather construct my UID of the top-level data set by concatenating the raw data with the CRC's of the subordinate data sets, and then CRC this construction. The question is, how does using this methodology affect the probability of collisions?

To put it in a language what will allow me to discuss my thoughts, I'll define a bit of notation. Call my top-level data set T, and suppose it consists of raw data set R and subordinate data sets Si, i=1..n. I can write this as T = (R, S1, S2, ..., Sn). If & represents concatenation of data sets, my original scheme can be thought of as:

UID_1(T) = CRC(R & S1 & S2 & ... & Sn)

and my new scheme can be thought of as

UID_2(T) = CRC(R & CRC(S1) & CRC(S2) & ... & CRC(Sn))

Then my questions are: (1) if T and T' are very different, what CRC algorithm minimizes prob( UID_1(T)=UID_1(T') ), and what CRC algorithm minimizes prob( UID_2(T)=UID_2(T') ), and how do these two probabilities compare?

My (naive and uninformed) thoughts on the matter are this. Suppose the differences between T and T' are in only one subordinate data set, WLOG say S1!=S1'. If it happens that CRC(S1)=CRC(S1'), then clearly we will have UID_2(T)=UID_2(T'). On the other hand, if CRC(S1)!=CRC(S1'), then the difference between R & CRC(S1) & CRC(S2) & ... & CRC(Sn) and R & CRC(S1') & CRC(S2) & ... & CRC(Sn) is a small difference on 4 bytes only, so the ability of UID_2 to detect differences is effectively the same as a CRC's ability to detect transmission errors, i.e. its ability to detect errors in only a few bits that are not widely separated. Since this is what CRC's are designed to do, I would think that UID_2 is pretty safe, so long as the CRC I am using is good at detecting transmission errors. To put it in terms of our notation,

prob( UID_2(T)=UID_2(T') ) = prob(CRC(S1)=CRC(S1')) + (1-prob(CRC(S1)=CRC(S1'))) * probability of CRC not detecting error a few bits.

Let call the probability of CRC not detecting an error of a few bits P, and the probability of it not detecting large differences on a large size data set Q. The above can be written approximately as

prob( UID_2(T)=UID_2(T') ) ~ Q + (1-Q)*P

Now I will change my UID a bit more as follows. For a "fundamental" piece of data, i.e. a data set T=(R) where R is just a double, integer, char, bool, etc., define UID_3(T)=(R). Then for a data set T consisting of a vector of subordinate data sets T = (S1, S2, ..., Sn), define

UID_3(T) = CRC(ID_3(S1) & ID_3(S2) & ... & ID_3(Sn))

Suppose a particular data set T has subordinate data sets nested m-levels deep, then, in some vague sense, I would think that

prob( UID_3(T)=UID_3(T') ) ~ 1 - (1-Q)(1-P)^m

Given these probabilities are small in any case, this can be approximated as

 1 - (1-Q)(1-P)^m = Q + (1-Q)*P*m + (1-Q)*P*P*m*(m-1)/2 + ... ~ Q + m*P

So if I know my maximum nesting level m, and I know P and Q for various CRCs, what I want is to pick the CRC that gives me the minimum value for Q + m*P. If, as I suspect might be the case, P~Q, the above simplifies to this. My probability of error for UID_1 is P. My probability of error for UID_3 is (m+1)P, where m is my maximum nesting (recursion) level.

Does all this seem reasonable?

Mark Adler · Accepted Answer · 2013-03-23T19:32:54.057

6

I want a mechanism to allow me to avoid re-transmitting data sets.

rsync has already solved this problem, using generally the approach you outline.

However, I would like to know if anyone can recommend which 32-bit CRC algorithm (i.e. which polynomial?) is best for minimizing the probability of collisions in this situation?

You won't see much difference among well-selected CRC polynomials. Speed may be more important to you, in which case you may want to use a hardware CRC, e.g. the crc32 instruction on modern Intel processors. That one uses the CRC-32C (Castagnoli) polynomial. You can make that really fast by using all three arithmetic units on a single core in parallel by computing the CRC on three buffers in the same loop, and then combining them. See below how to combine CRCs.

However, most of the time I already have the CRC's of the subordinate data sets, and I would rather construct my UID of the top-level data set by concatenating the raw data with the CRC's of the subordinate data sets, and then CRC this construction.

Or you could quickly compute the CRC of the entire set as if you had done a CRC on the whole thing, but using the already calculated CRCs of the pieces. Look at crc32_combine() in zlib. That would be better than taking the CRC of a bunch of CRCs. By combining, you retain all the mathematical goodness of the CRC algorithm.

edited Mar 23 '13 at 19:32

answered Mar 23 '13 at 19:26

Mark Adler

101,978
13
118
158

Excellent info! Didn't know about intel's crc32, nor that one could implement crc32_combine! crc32_combine makes my ramblings rather irrelevant. lol. The parallelization doesn't help in my case - I am either calculating the CRC in the process of serialization, which can't be parallelized, or I already have component CRC's, and then crc32_combine does the trick perfectly. Thanks Mark. – David I. McIntosh Mar 23 '13 at 20:39
Just out of curiosity though, since you are someone who would know, were my ramblings at least approximately correct? (even though quite useless given crc32_combine capabilities). – David I. McIntosh Mar 23 '13 at 20:50
I don't think you understand the parallelization I was describing. That is applied in a _single_ core on a single CRC calculation, as you are doing now on each of your component pieces. The parallelization makes that component CRC three times as fast. – Mark Adler Mar 23 '13 at 20:58
As for the ramblings, I mostly stopped reading after I saw that you were not aware of the ability to combine CRCs. A general comment is that if _T_ and _T'_ are "very" different as you state, then the details of the check calculation don't matter much, so long as the information in the message is well mixed into the check value. Which it is for any standard CRC. In the case of lots of errors, the probability of a false positive depends only on the number of bits in the check value. – Mark Adler Mar 23 '13 at 21:11
Quick question: I see the process to combine two CRC's involves (log_2 N) matrix multiplications. Do you have a rough feel for the minimum data length at which using the crc32_combine is more efficient than just running the CRC algorithm? – David I. McIntosh Mar 23 '13 at 21:22
Lastly, I think I did understand the parallelization you were describing, and if I were calculating CRC in isolation, it would help. Thing is, on a component piece, I cannot calculate the CRC until I have it serialized, but I can calculate the CRC _while_serializing it, with the cost of actually calculating the CRC being negligible compared to the cost of serializing. And the process of serializing it cannot be sped up with parallelization. (In other words, my CRC calculation is built-in to the serialization process.) – David I. McIntosh Mar 23 '13 at 21:24
As for the parallelization, you must have some buffer size you're using for the data in memory, on which you are calculating the CRC. You can just make that buffer size a small multiple of three, e.g. 768, and do three CRCs on the three 256-byte pieces. You can go even smaller and still get the benefit. – Mark Adler Mar 23 '13 at 23:40
Oh, I should mention that if the length of the second piece is fixed, i.e. you are always calling `crc32_combine()` with the same `len2`, then you can pre-calculate the operator to apply and then combination will take almost no time at all. – Mark Adler Mar 23 '13 at 23:52
let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/26821/discussion-between-david-i-mcintosh-and-mark-adler) – David I. McIntosh Mar 24 '13 at 17:28

David I. McIntosh · Answer 2 · 2013-03-25T15:59:37.743

Mark Adler's answer was bang on. If I'd taken my programmers hat off and put on my mathematicians hat, some of it should have been obvious. He didn't have the time to explain the mathematics, so I will here for those who are interested.

The process of calculating a CRC is essentially the process of doing a polynomial division. The polynomials have coefficients mod 2, i.e. the coefficient of each term is either 0 or 1, hence a polynomial of degree N can be represented by an N-bit number, each bit being the coefficient of a term (and the process of doing a polynomial division amounts to doing a whole bunch of XOR and shift operations). When CRC'ing a data block, we view the "data" as one big polynomial, i.e. a long string of bits, each bit representing the coefficient of a term in the polynomial. Well call our data-block polynomial A. For each CRC "version", there has been chosen the polynomial for the CRC, which we'll call P. For 32-bit CRCs, P is a polynomial with degree 32, so it has 33 terms and 33 coefficients. Because the top coefficient is always 1, it is implicit and we can represent the 32nd-degree polynomial with a 32-bit integer. (Computationally, this is quite convenient actually.) The process of calculating the CRC for a data block A is the process of finding the remainder when A is divided by P. That is, A can always be written

A = Q * P + R

where R is a polynomial of degree less than degree of P, i.e. R has degree 31 or less, so it can be represented by a 32-bit integer. R is essentially the CRC. (Small note: typically one prepends 0xFFFFFFFF to A, but that is unimportant here.) Now, if we concatenate two data blocks A and B, the "polynomial" corresponding to the concatenation of the two blocks is the polynomial for A, "shifted to the left" by the number of bits in B, plus B. Put another way, the polynomial for A&B is A*S+B, where S is the polynomial corresponding to a 1 followed by N zeros, where N is the number of bits in B. (i.e. S = x**N ). Then, what can we say about the CRC for A&B? Suppose we know A=Q*P+R and B=Q'*P+R', i.e. R is the CRC for A and R' is the CRC for B. Suppose we also know S=q*P+r. Then

A * S + B = (Q*P+R)*(q*P+r) + (Q'*P+R')
          = Q*(q*P+r)*P + R*q*P + R*r + Q'*P + R'
          = (Q*S + R*q + Q') * P    + R*r + R'

So to find the remainder when A*S+B is divided by P, we need only find the remainder when R*r+R' is divided by P. Thus, to calculate the CRC of the concatenation of two data streams A and B, we need only know the separate CRC's of the data streams, i.e. R and R', and the length N of the trailing data stream B (so we can compute r). This is also the content of one of Marks other comments: if the lengths of the trailing data streams B are constrained to a few values, we can pre-compute r for each of these lengths, making combination of two CRC's quite trivial. (For an arbitrary length N, computing r is not trivial, but it is much faster (log_2 N) than re-doing the division over the entire B.)

Note: the above is not a precise exposition of CRC. There is some shifting that goes on. To be precise, if L is the polynomial represented by 0xFFFFFFFF, i.e. L=x*31+x*30+...+x+1, and S_n is the "shift left by n bits" polynomial, i.e. S_n = x**n, then the CRC of a data block with polynomial A of N bits, is the remainder when ( L * S_N + A ) * S_32 is divided by P, i.e. when (L&A)*S_32 is divided by P, where & is the "concatenation" operator.

Also, I think I disagree with one of Marks comments, but he can correct me if I'm wrong. If we already know R and R', comparing the time to compute the CRC of A&B using the above methodology, as compared with computing it the straightforward way, does not depend on the ratio of len(A) to len(B) - to compute it the "straight forward" way, one really does not have to re-compute the CRC on the entire concatenated data set. Using our notation above, one only needs to compute the CRC of R*S+B. That is, instead of pre-pending 0xFFFFFFFF to B and computing its CRC, we prepend R to B and compute its CRC. So its a comparison of the time to compute B's CRC over again with the time to compute r, (followed by dividing R*r+R' by P, which is trivial and inconsequential in time likely).

You are correct -- my comment assumed calculating the CRC of the whole thing again, but you don't need to if you have the CRC of the first piece. I deleted that comment. — Mark Adler, Apr 07 '13 at 23:56

score 0 · Answer 3 · edited May 23 '17 at 12:06

Mark Adler's answer addresses the technical question so that's not what I'll do here. Here I'm going to point out a major potential flaw in the synchronization algorithm proposed in the OP's question and suggest a small improvement.

Checksums and hashes provide a single signature value for some data. However, being of finite length, the number of possible unique values of a checksum/hash is always smaller than the possible combinations of the raw data if the data is longer. For instance, a 4 byte CRC can only ever take on 4 294 967 296 unique values whilst even a 5 byte value which might be the data can take on 8 times as many values. This means for any data longer than the checksum itself, there always exists one or more byte combinations with exactly the same signature.

When used to check integrity, the assumption is that the likelihood of a slightly different stream of data resulting in the same signature is small so that we can assume the data is the same if the signature is the same. It is important to note that we start with some data d and verify that given a checksum, c, calculated using a checksum function, f that f(d) == c.

In the OP's algorithm, however, the different use introduces a subtle, detrimental degradation of confidence. In the OP's algorithm, server A would start with the raw data [d1A,d2A,d3A,d4A] and generate a set of checksums [c1,c2,c3,c4] (where dnA is the n-th data item on server A). Server B would then receive this list of checksums and check its own list of checksums to determine if any are missing. Say Server B has the list [c1,c2,c3,c5]. What should then happen is that it requests d4 from Server A and the synchronization has worked properly in the ideal case.

If we recall the possibilty of collisions, and that it doesn't always take that much data to produce one (e.g. CRC("plumless") == CRC("buckeroo")), then we'll quickly realize that the best guarantee our scheme provides is that server B definitely doesn't have d4A but it cannot guarantee that it has [d1A,d2A,d3A]. This is because it is possible that f(d1A) = c1 and f(d1B) = c1 even though d1A and d1B are distinct and we would like both servers to have both. In this scheme, neither server can ever know about the existence of both d1A and d1B. We can use more and more collision resistant checksums and hashes but this scheme can never guarantee complete synchronization. This becomes more important, the greater the number of files the network must keep track of. I would recommend using a cryptographic hash like SHA1 for which no collisions have been found.

A possible mitigation of the risk of this is to introduce redundant hashes. One way of doing is is to use a completely different algorithm since whilst it is possible crc32(d1) == crc32(d2) it is less likely that adler32(d1) == adler32(d2) simultaneously. This paper suggests you don't gain all that much this way though. To use the OP notation, it is also less likely that crc32('a' & d1) == crc32('a' & d2) and crc32('b' & d1) == crc32('b' & d2) are simultaneously true so you can "salt" to less collision prone combinations. However, I think you may just as well just use a collision resistant hash function like SHA512 which in practice likely won't have that great an impact on your performance.

You are correct in everything you say. But, two things about this particular problem make your comments not really applicable. (1) Using your notation, certain peculiarities about the data sets dnA and dnB (for fixed n) make it highly improbable that f(dnA) = f(dnB) unless dnA = dnB. In detail, dnA and dnB will differ in only limited ways, such that their CRC's are virtually guaranteed to be different. This is a peculiarity of our problem - we are not creating general tools for some product. And (2) in my original post, I did not sufficiently emphasize the importance of performance. — David I. McIntosh, Aug 22 '15 at 01:39
In particular, note that with the work above and some other simplifications, given CRC(A) and CRC(B), one can compute CRC(A&B) virtually instantaneously, especially on modern Intel processors with the CRC instruction (which also allows computing CRC(X) very quickly). However, you comments will serve as warnings to anyone who tries to use this approach for general synchronization. — David I. McIntosh, Aug 22 '15 at 01:46
I guess I didn't say so, but the concern was more that its possible `f(dnA) = f(dmB)` (possibly different indices). However, since you say that your problem is structured such that by construction this can't happen, I agree that this answer is more general than what you sought. Its been interesting to find that for the benchmarks of general implementations I've been running into, the performance difference between MD5 and CRC32 can be a factor smaller than 2. — jeteon, Aug 22 '15 at 02:37
How is the CRC32 implemented? In particular, is it using the Intel machine instruction, and is it doing the parallel-and-combine scheme suggested by Mark? I ask out of interest, not to challenge, because, again, in our circumstance we can compute the CRC32 almost for free while serializing. That also might have been true for MD5 - I don't know enough about it. Also, if we were comparing f(dnA) with f(dmB), there would indeed be very serious degradation issues, orders of magnitude more possibility for collisions. — David I. McIntosh, Aug 23 '15 at 03:57
The CRC32 in the benchmarks was generally implemented with C, they didn't use the machine instruction. The CRC32 instruction seems to be more than 5x as fast according to http://www.strchr.com/crc32_popcnt. The MD5 hash represents a "state" reached at the end of hashing. As far as I know you would just hash the files serially to get the combined hash (starting with the last hash value each time). However, I don't think there is a way to the total MD5 sum given component sums. You can just hash the concatenated hashes though. — jeteon, Aug 23 '15 at 21:10

Using CRCs as a digest to detect duplicates among files

3 Answers3