Comparing two data buffers before compressing with Boost' gzip_compressor to know which buffer can achieve better results

Question

I have a particular task at hand where I have two binary data buffers of exactly the same size (say a few megabytes) which I am compressing with Boost's boost::iostreams::gzip_compressor. The buffers are different (but similar) in content and I want to estimate which of the two will compress better. Of course I can actually do the compression and then know for sure - but ideally I would like to avoid this. I tried using classic "entropy" calculation formula for data (the one used in books as introduction to data compression) to get such estimate but my results were not consistent with actual gzip results. Here's my code for estimate in C++:

long getEntropySize (const unsigned char *ptr, size_t size) 
{
    long * table = new long [256];

    memset (table, 0, 256 * sizeof(long));

    for (size_t i=0; i < size; i++)
    {
        int v = ptr[i];

        table[v]++;
    }

    long total_bits = 0;

    for (size_t i=0; i < 256; i++)
    {
        if (table[i] == 0) continue;

        double p = (double)size / table[i];

        double bits = log2 (p); 

        total_bits += (bits * table[i]);
    }

    delete[] table;

    return ((total_bits + 7) / 8);
}

So, can I do better than this function?

Sure, you can compress or fake compress or even fake lossy compress the data and count how many bits it would take in any kind of way. Counting how many of each bit pattern is a form of ridiculously lossy compression, for example. I would consider looking at the specific compression algo you are going to use and try a weak+faster implementation. Maybe even only run it on a random subset of the original data. Entropy is "really" how small something can be compressed. There is no real shortcut. — Yakk - Adam Nevraumont, Mar 17 '21 at 05:03
You're just going to end up implementing significant parts of `gzip`. Do what you suggested yourself: GZip the data and see what actually happens. — user207421, Mar 17 '21 at 05:11
@user207421 I will actually have to do this not just with 2 but with 4 or 5 different sets in my case. So it will impact my performance. The best what I can hope for is some library call for gzip to get such estimate without doing the compression. I don't need to know absolute numbers, just will set #1 compress better than set #2. — antonio, Mar 17 '21 at 05:43
The function above works in about 75% of cases correctly. But in 25% of cases gzip will compress better one set over the other while the function tells me otherwise. — antonio, Mar 17 '21 at 05:47
The entropy calculation you are doing would be a good guess only if the gzip's compression is mostly due to taking advantage of the subset of bytes available, so there is not significant correlation in the ordering of the symbols. Unfortunately on most data, gzip is getting most of its compression from matching strings that appear previously in the data, not from the symbol frequencies. Your entropy calculation will completely miss that. — Mark Adler, Mar 17 '21 at 18:35
In order to estimate the impact of matching strings, you would need to search for said matching strings. That is exactly what gzip is spending most of its CPU time on, so you won't get any advantage over simply running gzip on the data. — Mark Adler, Mar 17 '21 at 18:35
If your few megabytes files are relatively homogenous, i.e. any given portion has similar structure to any other portion, then I would recommend running gzip on a small portion of the data, say 64K. That should be enough to see how much matching strings is benefiting compression, since the gzip window size if 32K. — Mark Adler, Mar 17 '21 at 18:37
@MarkAdler My data is not always homogeneous but it's a good suggestion. Typically it's a few megs so I guess I can take a few random snippets of data (say 16K each) and try zipping on them to roughly see how they compare. — antonio, Mar 18 '21 at 00:24

score 1 · Answer 1 · answered Mar 17 '21 at 17:13

The function above works in about 75% of cases correctly. But in 25% of cases gzip will compress better one set over the other while the function tells me otherwise

I'd say if your test is significantly cheaper that is a good success rate. Of course it depends on your input data. If your input data is always ~uniformly random, you will not get a good classifier as the compression margins will by definition be too small to be reliably guessed.

Profile your code

I see your code and immediately see the dynamic allocation. Depending on your compiler, standard version and optimization flags that might be optimized away (see Optimization of raw new[]/delete[] vs std::vector, but also note that it doesn't actually seem to happen on latest GCC/Clang/MSVC in C++20 mode: Compiler Explorer For All Three).

Therefore you should probably express your intent as well as profile the code to identify bottlenecks.

Here's a few changes I'd suggest to make things more generic, expressive and potentially more efficient (didn't profile that):

#include <array>
#include <cmath>
#include <cstdint>
#include <cstring>
#include <span>

template <typename Ch   = std::uint8_t, typename T,
        size_t Extent = std::dynamic_extent>
static inline constexpr auto getEntropySize(std::span<Ch const, Extent> buf, std::span<T, 256> table) {
    static_assert(sizeof(Ch) == 1);
    std::fill_n(table.data(), table.size(), 0);

    for (std::uint8_t v : buf) { // implicit conversion to unsigned chars
        ++table[v];
    }

    T total_bits = 0;

    for (auto bucket: table) {
        if (bucket) {
            double p    = static_cast<double>(buf.size()) / bucket;
            double bits = std::log2(p);

            total_bits += (bits * bucket);
        }
    }

    return (total_bits + 7) / 8;
}

template <typename Buf>
static inline constexpr auto getEntropySize(Buf const& buf) {
    std::array<long, 256> table;
    return getEntropySize(std::span(std::data(buf), std::size(buf)),
                        std::span(table));
}

#include <cstdio>
#include <string>
#include <string_view>
#include <vector>
int main() {
    using namespace std::literals;
    std::printf("%ld\n", getEntropySize("xxxxxxxxxxxx")); // includes trailing NUL
    std::printf("%ld\n", getEntropySize("xxxxxxxxxxxx"s));
    std::printf("%ld\n", getEntropySize("xxxxxxxxxxxx"sv));

    std::printf("%ld\n", getEntropySize("Hello world!"));
    std::printf("%ld\n", getEntropySize("Hello world!"sv));
    std::printf("%ld\n", getEntropySize("Hello world!"s));

    std::printf("%ld\n", getEntropySize(std::vector<unsigned char>{
        0xec, 0x48, 0xf0, 0x77, 0xf7, 0xd1, 0xd0, 0x08, 0xa8, 0x4b,
        0x1d, 0x61, 0x24, 0x77, 0xe8, 0x16, 0xe1, 0x09, 0x9a, 0x65,
        0x16, 0x94, 0xe7, 0xd3, 0x4b, 0xa4, 0xa7, 0x1a, 0xe7, 0x29,
        0x15, 0x59, 0x79, 0x4e, 0x19, 0x79, 0x17, 0xfd, 0x0a, 0x34}));
}

Prints Live On GCC and Clang¹

BONUS

You might use a bit twiddling hack to get the log2 implementation for cheaper: Fast computing of log2 for 64-bit integers

See e.g. std::countl_zero, __builtin_clz_ etc.

Thanks for the feedback! Yes, you're right - the 75% rate is not bad (that's just my observation on very limited data sets I have been trying so far). The entropy function is way, way faster than doing actual compression, plus when there's a discrepancy in the remaining 25% of cases, the entropy and the zipped sizes differences are not huge, so it looks like I am picking up the important cases at least. Good point with profiling. I guess for 256 4-byte longs I could just use stack instead of doing allocation with new. This code is in development so it may need some work. — antonio, Mar 17 '21 at 18:05

Comparing two data buffers before compressing with Boost' gzip_compressor to know which buffer can achieve better results

1 Answers1

Profile your code

BONUS