Binary diff algorithm for small binary blobs

Question

There is a lot of information about binary diff algorithms for pretty big files (1MB+). However, my use case is different. This is why it is not a duplicate.

I have a collection of many objects, each in 5-100 byte range. I want to send updates on those objects over the network. I want to compile updates into individual TCP packets (with a reasonable MTU of ~1400). I want to try and fit as much updates in each packet as possible: first add their IDs, and then put the binary diff of all the binary objects, combined.

What is the best binary diff algorithm to be used for such a purpose?

Possible duplicate of [Binary diff tool for very large files?](https://stackoverflow.com/questions/688504/binary-diff-tool-for-very-large-files) — Evan Carroll, Nov 09 '18 at 04:23
I did read your question. Why do you think it matters the size of your objects? Generally if it works for a big file, it'll work for a small file. Usually you don't have a problem scaling down. Loseless compression is loseless compression. Perhaps you did want something more algorithmic, but it doesn't make sense to define it like that. If you want to compress for the network packet, use network compression. If you want the smallest transfer size use maximize compression in the app and let the network do it's thing regularly (which is how most things work). — Evan Carroll, Nov 09 '18 at 18:46
To make matters even worse, you accepted an answer that just points to "longest-common-subsequence". That's not an algorithm at all, it's an abstract problem. The algorithm behind diff is [mentioned on this problem](https://stackoverflow.com/questions/42635889/myers-diff-algorithm-vs-hunt-mcilroy-algorithm). — Evan Carroll, Nov 09 '18 at 18:50

score 1 · Answer 1 · answered Feb 19 '18 at 17:51

The simple answer is to combine your many small objects into a single large one, and then use the existing binary diff algorithms to efficiently send the diff on that combined object.

But there is no need to roll your own. I would personally solve this problem by putting all of the objects into a filesystem, and then sending the diff using rsync.

This is theoretically not optimal. However consider the following:

The implementation is simple.
The code is battle tested - it will take a long time for your code to meet a similar level of reliability.
This implementation covers edge cases where the receiving side's state is not what the sender expects. As could happen if, say, the receiver had a crash/restart.

There are cases where this is the wrong solution. But I would insist on having this straightforward solution fail before being clever and sophisticated about it.

Thanks for your answer. Unfortunately, I can't rely on another application (such as rsync) being installed on a target system, and I suspect that writing and reading from disc for every update (which I need to perform in about 50ms) wouldn't work well from performance perspective. — Max Yankov, Feb 22 '18 at 08:43

score 1 · Accepted Answer · answered Feb 19 '18 at 18:59

With such small objects, you could use the classic longest-common-subsequence algorithm to create a 'diff'.

That is not the best way to approach your problem, however. The LCS algorithm is constrained by the requirement to use each original byte only once when matching to the target sequence. You probably don't really need that constraint to encode your compressed packets, so it will result in sub-optimal solution in addition to being kinda slow.

Your goal is really to use the example of the original objects to compress the new objects, and you should think about it in those terms.

There are many, many ways, but you probably have some idea about how you want to encode those new objects. Probably you are thinking of replacing sections of the new objects with references to sections of the original objects.

In that case, it would be practical to make a suffix array for each original object (https://en.wikipedia.org/wiki/Suffix_array). Then when you're encoding the corresponding new object, at each byte you can use the suffix array to find the longest matching segment of the old object. If it's long enough to result in a savings, then you can replace the corresponding bytes with a reference.

Thanks for your answer. After experimenting with more complex compression techniques, I ended up implementing the simplest one: I just write the length of common prefix and suffix for old and new version of the object. Turns out, with the nature of my data (most updates modify some consequent bytes in the middle of the byte array) it gave me compression ratio of almost 2. — Max Yankov, Feb 22 '18 at 08:41

Binary diff algorithm for small binary blobs

2 Answers2