2

I have a tree structure where each node knows its CRC. What's a reasonable way to compute a CRC for each sub-tree that would give me a good value for the entire sub-tree to that point? In other words, a value to identify if any part of the sub-tree was changed.

My current thought is simply take each child node CRC, convert it to a string/byte[], concatenate all the nodes together, and take the CRC of that byte[]. But I'm not sure if this might lead to easy collisions as I suspect this removes quite a bit of information.

(I looked at crc32_combine but it seems inappropriate because I don't have any lengths. I could use a length of zero, but would that be any better or worse?)

Working in C# but I guess this is really language agnostic.

EDIT: Ended up going with this technique. Will switch to longer hashes if collisions seem to be a problem. While I don't need leaf order to be important, am not using xor just in case it is later on.

  • Well, as long as you realize that a different CRC means there was a change (of course) but that no difference does _not_ necessarily mean that there was no change, I think you can probably get away with xor'ing the values - possibly with a bit rotation thrown in for each value. – 500 - Internal Server Error Nov 07 '14 at 20:59
  • 2
    The concern that I would have with XORing the hashes is that it doesn't take position into account. So if a node moves within a subtree, that subtree would still have the same hash. – Isaac Hildebrandt Nov 07 '14 at 21:07
  • 1
    @KingIsaac: Correct - hence my speculation along the lines of rotating each value before the xor - that could be based on its horizontal offset. – 500 - Internal Server Error Nov 07 '14 at 21:16
  • "I looked at crc32_combine but it seems inappropriate because I don't have any lengths." You must have lengths. In order to compute the CRCs that you have, some number of bytes were fed to the CRC algorithm. How many bytes were fed to the CRC for each node? – Mark Adler Nov 07 '14 at 21:33
  • @MarkAdler: I can assume CRCs are available, but I can't guarantee I'd know a length. A particular node may be completely outside my control. Although often not. –  Nov 07 '14 at 21:55

3 Answers3

0

I'd probably use least SHA1 for your checksums since the collisions aren't that infrequent for MD5s and your idea about combining the CRCs seems solid though personally I'd XOR the hashes together to save on RAM and CPU cycles.

Srdjan Grubor
  • 2,605
  • 15
  • 17
  • Are you (and @Bob) suggesting SHA-1/SHA-2 just because of the larger hash (valid point), or is there something specific about combining them that adds value too? –  Nov 07 '14 at 22:00
  • @WillRubin It's only for reduction of collisions but it doesn't really do much else than that. – Srdjan Grubor Nov 07 '14 at 22:36
0

You should use something designed for this such as SHA-2. You may be able to get away with CRC32 depending on your particular requirements. There is a similar question posted here with more discussion:

Can CRC32 be used as a hash function?

Community
  • 1
  • 1
0

Ideally you would combine the CRCs of the nodes to compute the CRC of a sub-tree, using something like crc32_combine(). The result would be the same as computing the CRC over all the nodes in whatever canonical ordering you have defined. This would only check the ordering though, not the structure of the tree. A different structure with the same ordering would give the same CRC. This will be true no matter how you combine the CRCs, unless you include additional information on the tree structure.

The use of crc32_combine() requires the length of the data for each of the CRCs being combined (except the first). This is apparently not saved and not available in this case. You can instead make a stream of bytes of the CRCs in the canonical order and take the CRC of that stream. (You will need to decide if the CRCs are to be stored big or little endian in the byte stream, and then stick to your convention.)

The use of cryptographic signatures such as SHA1 or MD5 is unnecessary, unless you are worried for some reason that a devious human is interfering with your computed check values and trying to trick you into thinking that contents of the tree have not changed when they have. (The devious human can already do this at the level of the nodes anyway, since CRCs are easily spoofed.) Otherwise such signatures are just a waste of CPU time. If you simply want a longer hash, more than 32 bits, to reduce the probability of collisions, then you can use a fast hash function such as one from the CityHash family.

Mark Adler
  • 101,978
  • 13
  • 118
  • 158