4

This is a bit of a weird situation to be in. I know that the real solution is to dive into open-source and fix the bug. But please humor me…

I am using the GELF standard to send log messages into logstash (version 5.0.0). Sadly, logstash's GELF parser (Ruby gem gelfd:0.2.0) can only parse compressed messages.

The path of least resistance for me is to just compress my every log message. Even if it's a 100 byte message. There would be no meaningful size benefit (fits in a single UDP datagram either way, and its destination is localhost), and in fact the file may become larger.

My worry is that my applications will be performing a lot of unnecessary compression β€” and my logstash server performing a lot of unnecessary decompression β€” just to work around this bug in gelfd.

The compression algorithms supported by GELF are GZIP and ZLIB.

Using these algorithms: how significant is the computation cost of attempting to compress and then decompress a tiny file?


Edit: Sorry for submitting this question without having done any research of my own. As penance: I have now submitted my own benchmark results as an answer.

Birchlabs
  • 7,437
  • 5
  • 35
  • 54

2 Answers2

4

I have written a benchmarking script for GZIP.

It is not totally representative:

  • does not use exactly the same compressor/decompressor implementation that the Java program would be using
  • does not operate under the exact same runtime conditions as the Java program
  • does not test a variety of strings
  • does not test a variety of string sizes

Nevertheless, it gives a useful heuristic.

plain="2016-11-09 20:56:02,469 ERROR [main] c.s.HelloExample - Something wrong with customer 'CUS-123e4567-e89b-12d3-a456-42665544'"

echo "Log message which we are using for this test is:"
echo $plain

echo "Time taken to echo this string 10,000 times:"

time for a in $(seq 1 10000);
do
    echo $plain > /dev/null
done

echo "Time taken to echo and compress this string 10,000 times:"

time for a in $(seq 1 10000);
do
    echo $plain | gzip -cf > /dev/null
done

echo "Time taken to echo, compress and decompress this string 10,000 times:"

time for a in $(seq 1 10000);
do
    echo $plain | gzip -cf | gzip -cfd > /dev/null
done

Here's how the measurements came out:

Log message which we are using for this test is:
2016-11-09 20:56:02,469 ERROR [main] c.s.HelloExample - Something wrong with customer 'CUS-123e4567-e89b-12d3-a456-42665544'
Time taken to echo this string 10,000 times:

real    0m1.940s
user    0m0.591s
sys 0m1.333s
user+sys 0m1.924s
Time taken to echo and compress this string 10,000 times:

real    0m22.028s
user    0m11.309s
sys 0m17.325s
user+sys 0m28.634s
Time taken to echo, compress and decompress this string 10,000 times:

real    0m22.983s
user    0m18.761s
sys 0m27.322s
user+sys 0m46.083s
[Finished in 47.0s real time]

User+sys shows how much CPU time was used; that's the bit that is important for working out how computationally intensive this is.

So, compression takes about 14.9x more computation than just echoing the string raw.

Compression + decompression takes 24.0x more computation than just echoing the string raw. This is only 1.6x more computation than compressing.

Conclusions:

  • It's not cheap to compress even a tiny file in GZIP.
  • GZIP decompression is cheap!

Caution: this test may have, in reality, been measuring startup and cleanup costs of the gzip executable. I am not sure if those are significant, but certainly we can see that it is a threaded application (user + sys < real). So I could imagine setup overhead such as starting pthreads.

I was not able to find any conclusive answer for what the time complexity of GZIP is with respect to the size of the input. But it would be interesting to know.

Community
  • 1
  • 1
Birchlabs
  • 7,437
  • 5
  • 35
  • 54
2

You don't need to actually compress in order to generate a valid zlib or gzip stream. The deflate compressed data format used by both has a stored mode for incompressible data. You can easily write this format yourself without even using zlib. Though you will still want to use an integrity check routine from zlib, i.e. adler32() for the zlib format or crc32() for the gzip format.

The zlib header can be 0x78 0x01. Then you write stored blocks which consist of 00, followed by the two-byte length in little endian order, and then the same two-bytes but their one's complement, and then length bytes of stored data. The last stored block starts with a 01 instead of a 00. Follow that with an Adler-32 check of the uncompressed data, four bytes in big-endian order. Done.

Example valid zlib stream encoding 123456789, in hex:

78 01 01 09 00 f6 ff 31 32 33 34 35 36 37 38 39 09 1e 01 de
Mark Adler
  • 101,978
  • 13
  • 118
  • 158
  • I hadn't thought of that! Brilliant. Sadly it requires me to control how `log4j2` constructs "compressed" GELF messages, and also requires me to control how `logstash` parses those "compressed" GELF messages. This workaround would be harder to implement than fixing the actual problem with `gelfd`. :) – Birchlabs Nov 10 '16 at 00:51