The function above works in about 75% of cases correctly. But in 25% of cases gzip will compress better one set over the other while the function tells me otherwise
I'd say if your test is significantly cheaper that is a good success rate. Of course it depends on your input data. If your input data is always ~uniformly random, you will not get a good classifier as the compression margins will by definition be too small to be reliably guessed.
Profile your code
I see your code and immediately see the dynamic allocation. Depending on your compiler, standard version and optimization flags that might be optimized away (see Optimization of raw new[]/delete[] vs std::vector, but also note that it doesn't actually seem to happen on latest GCC/Clang/MSVC in C++20 mode: Compiler Explorer For All Three).
Therefore you should probably express your intent as well as profile the code to identify bottlenecks.
Here's a few changes I'd suggest to make things more generic, expressive and potentially more efficient (didn't profile that):
#include <array>
#include <cmath>
#include <cstdint>
#include <cstring>
#include <span>
template <typename Ch = std::uint8_t, typename T,
size_t Extent = std::dynamic_extent>
static inline constexpr auto getEntropySize(std::span<Ch const, Extent> buf, std::span<T, 256> table) {
static_assert(sizeof(Ch) == 1);
std::fill_n(table.data(), table.size(), 0);
for (std::uint8_t v : buf) { // implicit conversion to unsigned chars
++table[v];
}
T total_bits = 0;
for (auto bucket: table) {
if (bucket) {
double p = static_cast<double>(buf.size()) / bucket;
double bits = std::log2(p);
total_bits += (bits * bucket);
}
}
return (total_bits + 7) / 8;
}
template <typename Buf>
static inline constexpr auto getEntropySize(Buf const& buf) {
std::array<long, 256> table;
return getEntropySize(std::span(std::data(buf), std::size(buf)),
std::span(table));
}
#include <cstdio>
#include <string>
#include <string_view>
#include <vector>
int main() {
using namespace std::literals;
std::printf("%ld\n", getEntropySize("xxxxxxxxxxxx")); // includes trailing NUL
std::printf("%ld\n", getEntropySize("xxxxxxxxxxxx"s));
std::printf("%ld\n", getEntropySize("xxxxxxxxxxxx"sv));
std::printf("%ld\n", getEntropySize("Hello world!"));
std::printf("%ld\n", getEntropySize("Hello world!"sv));
std::printf("%ld\n", getEntropySize("Hello world!"s));
std::printf("%ld\n", getEntropySize(std::vector<unsigned char>{
0xec, 0x48, 0xf0, 0x77, 0xf7, 0xd1, 0xd0, 0x08, 0xa8, 0x4b,
0x1d, 0x61, 0x24, 0x77, 0xe8, 0x16, 0xe1, 0x09, 0x9a, 0x65,
0x16, 0x94, 0xe7, 0xd3, 0x4b, 0xa4, 0xa7, 0x1a, 0xe7, 0x29,
0x15, 0x59, 0x79, 0x4e, 0x19, 0x79, 0x17, 0xfd, 0x0a, 0x34}));
}
Prints Live On GCC and Clang¹
1
0
0
5
4
4
24
BONUS
You might use a bit twiddling hack to get the log2
implementation for cheaper: Fast computing of log2 for 64-bit integers
See e.g. std::countl_zero
, __builtin_clz_
etc.