10

I have a system with roughly a 100 million documents, and I'd like to keep track of their modifications between mirrors. In order to exchange information about modifications effectively, I want to send information about modified documents by days, not by each separate document. Something like this:

[ 2012/03/26, cs26],
[ 2012/03/25, cs25],
[ 2012/03/24, cs24],
...

where each cs is the checksum of timestamps of all documents created on a particular day.

Now, the problem I'm running into is that I don't know of an algorithm that could "subtract" data from the checksum when a document is being deleted. None of the cryptographic hashes fit the need, for obvious reasons, and I couldn't find any algorithms for CRC that would do this.

One option I considered was to have deletes add extra information to the hash, but this would lead to even more problems, as nodes can receive delete requests in different order, and when a node would restart it would re-read all the timestamps from the documents, and thus the information about the deletes would be lost.

I also wouldn't like using a hash tree with all document hashes in-memory, as that would use roughly 8 gigs of memory, and I think it's a bit of overkill for just this need.

For now the best option seems to regenerate these hashes completely from time to time in background, but that is also a lot of needless overhead, and wouldn't provide immediate information on changes.

So, do you guys know of a checksum algorithm that would let me "remove" some data from the checksum? I need the algorithm to be somewhat fast and the checksum that would strongly indicate the smallest of changes (that's why I can't really use plain XOR).

Or maybe you have better ideas about the whole design?

Cœur
  • 37,241
  • 25
  • 195
  • 267
  • I don't get it. Why can't you XOR all of the check-sums. If one document gets deleted, you XOR on that documents checksum, and you should have a checksum for the rest of the files. – aioobe Mar 26 '12 at 14:08
  • How many modifications do you have per day? Couldn't you just do a checksum for the modifications? – biziclop Mar 26 '12 at 14:08
  • @aioobe I don't really keep separate checksums for particular documents, so it just didn't cross my mind but yes, that's a great idea, essentially Jason S suggested the same thing – Andrejs Krasilnikovs Mar 26 '12 at 14:16
  • It is not clear what do you want to do with these checksums. Suppose a node receives `[ 2012/03/26, cs26]`... what now? – n. m. could be an AI Mar 26 '12 at 14:17
  • @biziclop modifications can arrive in different sequence to each node, so in that case the nodes might actually be in sync, but they will think otherwise. – Andrejs Krasilnikovs Mar 26 '12 at 14:18
  • @n.m. it will then compare the checksum to its own, request the list of documents and their timestamps on that date if there is a mismatch in checksums, and then the contents of the documents where the timestamp doesn't match – Andrejs Krasilnikovs Mar 26 '12 at 14:19
  • So you can send separately a checksum of timestamps of deleted documents (grouped by both creation and deletion date). Re-sync each mismatched creation date (either different checksums for that date, or it is present in one set and absent in the other). – n. m. could be an AI Mar 26 '12 at 14:45
  • @n.m. Interesting idea, the only problem with it is that equal nodes would have to go to the next step if one got a create/delete request for a document and the other never saw either of them - the repository would actually be identical, yet the hashes will say otherwise. But I think it's almost irrelevant, as the situation is rare and the overhead in such a case is small. – Andrejs Krasilnikovs Mar 26 '12 at 14:53
  • @n.m That's pretty much what I had in mind too: create a separate checksum for new documents created, one for modified documents and one for the deleted. The overhead is minimal, as the only difference to a single checksum is that with every operation you have to decide which of the three checksums to update. – biziclop Mar 26 '12 at 15:04

1 Answers1

5

How about

hash = X(documents, 0, function(document) { ... })

where X is an aggregate XOR (javascript-y pseudocode follows):

function X(documents, x, f)
{
   for each (var document in documents)
   {
      x ^= f(document);
   }
   return x;
}

and f() is a hash of individual document information? (whether timestamp or filename or ID or whatever)

The use of XOR would allow you to "subtract" out documents, but using a hash on a per-document basis allows you to preserve a hash-like quality of detecting small changes.

Jason S
  • 184,598
  • 164
  • 608
  • 970