55

Are there any lossless compression methods that can be applied to floating point time-series data, and will significantly outperform, say, writing the data as binary into a file and running it through gzip?

Reduction of precision might be acceptable, but it must happen in a controlled way (i.e. I must be able to set a bound on how many digits must be kept)

I am working with some large data files which are series of correlated doubles, describing a function of time (i.e. the values are correlated). I don't generally need the full double precision but I might need more than float.

Since there are specialized lossless methods for images/audio, I was wondering if anything specialized exists for this situation.

Clarification: I am looking for existing practical tools rather than a paper describing how to implement something like this. Something comparable to gzip in speed would be excellent.

Zword
  • 6,605
  • 3
  • 27
  • 52
Szabolcs
  • 24,728
  • 9
  • 85
  • 174
  • What are you going to do with the data? Are you transferring it? Storing for awhile before use? Just trying to use less memory? Or are you specifically looking for a compact way to store time-series data? – zdav Dec 25 '11 at 17:28
  • @zdav Several reasons, but I'm not sure that this'll help with all of those. 1. I'm low on memory, it would help to store the unused bits in-memory compressed, while uncompressing only those that I'm processing. 2. Compression can spead up reading/writing from/to disk (it did for me with huge text files and gzip) 3. Yes, I'm storing data compressed on disk (gzip now). – Szabolcs Dec 25 '11 at 17:37
  • @zdav Right now most of my data is almost periodic, and gzip gives me a 4x compression. I am sure a lot better is possible because of the high correlations in the data. – Szabolcs Dec 25 '11 at 17:38
  • 7
    You say, "lossless", but you also say "reduction of precision is acceptable." But, reduction of precision _is_ loss. – Solomon Slow Feb 10 '14 at 23:34
  • @jameslarge I also said that it must be done in a controlled way, i.e. set a bound on it. I do not need all 15 digits of a double precision value in all applications, but I need to be able to set a guarantee of preserving e.g. *at least* 6 digits, or something similar. Does this clarify the question? Let's not get stuck arguing about irrelevant details, I think the question is quite clear in what I am looking for. A solution that doesn't reduce precision at all is acceptable for my needs. A solution that doesn't reduce it below a settable threshold is also acceptable. – Szabolcs Feb 11 '14 at 18:01
  • If you know beforehand how much precision you need to retain, you might try to convert the raw values to a string representation. A quick test using VB.Net (with random values) shows it compresses down to 45% instead of 150%. YMMV – Stephan B Feb 12 '14 at 10:33
  • 1
    Reducing precision in a controlled way _is_ "lossy compression." "Lossless" means that compressing and then uncompressing yields exactly the same data, bit-for-bit, as the original. If the result is only an approximation (usually degraded in some controlled/bounded way), then that's "lossy compression." – Solomon Slow Feb 13 '14 at 18:52
  • 3
    Can you leave aside the hair splitting @jameslarge ? It adds nothing to the discussion. This type of "I have to be right" compulsive arguing doesn't make StackOverflow a better place. Let's focus on solving the problem, shall we? – Szabolcs Feb 13 '14 at 18:56
  • Was wondering if you are using C# or C++ and if you looked at LZ4? I have been using an LZ4 algorithm in C#, which does not provide the highest level of compression for our data, but seems to be the fastest compression and decompression for the level of compression it does support. – Frank Hileman Feb 15 '14 at 01:23
  • The dialogue about lossless vs. "reduction of precisison might be acceptable" is strange; I get the feeling james is mis-parsing the question or something, rather than that it's an ego thing. I read the question as "are there any lossless... or, I might be interested in lossy solutions as well". I think that was the intent, and it's clear and non-problematic to me, but perhaps it would help if you just prepended "Or," to the second sentence: "Or, reduction of precision might be acceptable, ..." and perhaps include the word lossy as well. Or, you could just have fun yelling at james :-) – Don Hatch Jul 03 '14 at 02:36
  • 5
    I see you asked a similar question over on scicomp and got some excellent answers, so I wanted to make sure there's a reference to that here: http://scicomp.stackexchange.com/questions/1671/compressing-floating-point-data – Don Hatch Jul 03 '14 at 02:44

8 Answers8

28

Here are some ideas if you want to create your own simple algorithm:

  • Use xor of the current value with the previous value to obtain a set of bits describing the difference.
  • Divide this difference into two parts: one part is the "mantissa bits" and one part is the "exponent bits".
  • Use variable length encoding (different number of bits/bytes per value), or any compression method you choose, to save these differences. You might use separate streams for mantissas and exponents, since mantissas have more bits to compress.
  • This may not work well if you are alternating between two different time-value streams sources. So you may have to compress each source into a separate stream or block.
  • To lose precision, you can drop the least significant bits or bytes from the mantissa, while leaving the exponent intact.
Frank Hileman
  • 1,159
  • 9
  • 19
26

You might want to have a look at these resources:

You might also want to try Logluv-compressed TIFF for this, thought I haven't used them myself.

Bobrovsky
  • 13,789
  • 19
  • 80
  • 130
7

Since you state that you need a precision somewhere between 'float' and 'double': you can zero out any number of least significant bits in single- and double-precision floats. IEEE-754 floating point numers are represented binary roughly like seeefffffffff, which represents the value

sign*1.fffffff*2^(eee).

You can zero out the least significant fraction (f) bits. For single-precision (32-bit) floats, there are 23 fraction bits of which you can zero out up to 22. For double-precision (64-bit), it's 52 and up to 51. (If you zero out all bits, then the special values NaN and +/-inf will be lost).

Especially if the data represents decimal values such as 1.2345, this will help in data compression. That is because 1.2345 cannot be represented exactly as a binary floating point value, but rather as 0x3ff3c083126e978d, which is not friendly to data compression. Chopping off the least significant 24 bits will result in 0x3ff3c08312000000, which is still accurate to about 9 decimal digits (in this example, the difference is 1.6e-9).

If you do this on the raw data, and then store the differences between subsequential numbers, it will be even more compression-friendly (via gzip) if the raw data varies slowly.

Here is an example in C:

#include <inttypes.h>

double double_trunc(double x, int zerobits)
{
  // mask is e.g. 0xffffffffffff0000 for zerobits==16
  uint64_t mask = -(1LL << zerobits);  
  uint64_t floatbits = (*((uint64_t*)(&x)));
  floatbits &= mask;
  x = * ((double*) (&floatbits));
  return x;
}

And one in python/numpy:

import numpy as np

def float_trunc(a, zerobits):
    """Set the least significant <zerobits> bits to zero in a numpy float32 or float64 array.
    Do this in-place. Also return the updated array.
    Maximum values of 'nzero': 51 for float64; 22 for float32.
    """

at = a.dtype
assert at == np.float64 or at == np.float32 or at == np.complex128 or at == np.complex64
if at == np.float64 or at == np.complex128:
    assert nzero <= 51
    mask = 0xffffffffffffffff - (1 << nzero) + 1
    bits = a.view(np.uint64)
    bits &= mask
elif at == np.float32 or at == np.complex64:
    assert nzero <= 22
    mask = 0xffffffff - (1 << nzero) + 1
    bits = a.view(np.uint32)
    bits &= mask

return a
Han-Kwang Nienhuys
  • 3,084
  • 2
  • 12
  • 31
7

Since you're asking for existing tools, maybe zfp will do the trick.

primfaktor
  • 2,831
  • 25
  • 34
6

One technique that the HDF5 people use is "shuffling", where you group each byte for N floating point values together. This is more likely to give you repetitive sequences of bytes which will compress better with gzip, for example.

A second method I have found which greatly reduces the size of compressed gzipped data is to first convert the data to the float16 (half-precision) format and back again to float32. This produces a lot of zeros in the output stream which can shrink file sizes by around 40-60 per cent after compression. One subtlety is that the maximum float16 value is rather low, so you may want to scale your data first, e.g. in python

import numpy as np
import math

input = np.array(...)

# format can only hold 65504 maximum, so we scale input data
log2max = int(math.log(np.nanmax(input), 2))
scale = 2**(log2max - 14)
scaled = input * (1./scale)

# do the conversion to float16
temp_float16 = np.array(scaled, dtype=np.float16)
# convert back again and rescale
output = np.array(temp_float16, dtype=np.float32) * scale

Some tests suggest that the mean absolute fractional difference between input and output for some data are around 0.00019 with a maximum of 0.00048. This is in line with the 2**11 precision of the mantissa.

xioxox
  • 2,526
  • 1
  • 22
  • 22
  • 1
    You might want to post this here as well: http://scicomp.stackexchange.com/questions/1671/compressing-floating-point-data It looks like an easy and practical solution. – Szabolcs Feb 04 '16 at 11:31
5

Possible methods that can be used for floating-point compression:

You can test all these methods, with your data, using the icapp tool for linux and windows.

powturbo
  • 311
  • 3
  • 3
1

You can use Holt's Exponential smoothing algorithm (which is prediction based compression algorithm). Initially assign some weight to the data and predict the next value.If both the data are same,it produces many zero's in the MSB by doing XOR operation

Giriraj
  • 21
  • 4
0

I just found this link that focuses on compressing FLOAT32: https://computing.llnl.gov/projects/floating-point-compression

My old answer below remains general to any data point

TAR ZSTD seems to be the fastest and the "compressiest" zipping algorithm out there. you can use it with a single command of:

tar --use-compress-program=zstd -cvf NameOFCompressedFile.tar.zst ./Files2BeCompressed   

For unzipping the files, use the command:

tar --use-compress-program=zstd -xvf NameOFCompressedFile.tar.zst

And to list the contents of the file without decompressing use the command:

tar --use-compress-program=zstd --list --verbose --file=NameOFCompressedFile.tar.zst

The zstd algorithm only works on UNIX operating systems as far as I know. you can find more info about it here: https://github.com/centminmod/tar-zstd/blob/master/readme.md