2

I have a type of data that is output as ~28 million integers ranging from 0 to 4095 (It technically comes of the hardware out as a signed 16-bit integers ranging from 0 to (1/2) * 2^16, but this representation is needlessly precise). In principle therefore each datapoint's value can be represented by 12 bits - a byte and a nybble, if you will. I'm dealing with, in the long term, moderately large volumes of this data (Terabytes in the double digits) which I intend to store as binaries, so obviously losslessly compressing it to 75% of its size would be welcome.

Obviously I could just write a function that encodes my data into booleans and back and use Numpy's binary handling functions to parse. However I have to balance this against ease/speed of storage and retrieval. Therefore I wonder if there's any existing package, algorithm, etc which accomplishes this in a simple and efficient way. I can work with Fortran or C if I need to, so it's an option to make a module in those, but my colleagues would prefer if I didn't.

Green Cloak Guy
  • 23,793
  • 4
  • 33
  • 53
LC Nielsen
  • 53
  • 5
  • 1
    Do some experiments and try gzip, bzip, lzw etc and see what you get. Don't make your files too big - rollover often to the next file, else you'll take forever accessing the part you want. And don't forget to make an off-site copy if it's important. – Mark Setchell Nov 04 '19 at 18:33
  • You might find [How to create objects of arbitrary memory size?](https://stackoverflow.com/questions/29894071/how-to-create-objects-of-arbitrary-memory-size) helpful. – martineau Nov 04 '19 at 18:48
  • Does your data correspond to something intelligible, like 7,000 seconds of recording at 4kHz to make 28m samples? If so, you could maybe store it as a 7000x4000 PNG file which compresses 16-bits losslessly but is instantly visualisable. Or a 16-bit PGM file which is also visualisable and can be gzipped too. It could then also be self-describing because the width and height are encoded in the file. – Mark Setchell Nov 04 '19 at 18:50
  • how random is your data? The less random it is, the more effective any gzip-like compression algorithm will be. No need to change the format if you compress the data anyway. – Walter Tross Nov 04 '19 at 19:03
  • The data consists of units of 7×2048^2 intensities from a CCD, so it is effectively a set of images. The PNG option is something I will definitely look at, I want to try a few different approaches though, this won't be the last big dataset I deal with, so I might as well play around with an "easy" dataset. – LC Nielsen Nov 04 '19 at 19:31
  • the answer you accepted is not of much use for this kind of data – Walter Tross Nov 04 '19 at 21:45
  • I wrote an answer on how to do this in the opposite direction, which may be helpful for your problem (2,2 GB/s): https://stackoverflow.com/a/45070947/4045774 But if your main concern is compression using a fast compression package like Blosc may be far superior https://stackoverflow.com/a/56761075/4045774 – max9111 Nov 05 '19 at 11:56
  • Walter: I'm not just looking for how to solve this exact problem with this exact data but also for ideas and input on the various ways one can approach this type of problem. Max: thanks! Yes, often ready-made solutions are more optimized than anything one can think of, but I like to try things out, compare and contrast to better understand the problem and its solutions. – LC Nielsen Nov 05 '19 at 15:32

2 Answers2

3

Have you tried gzip.compress()? https://docs.python.org/3/library/gzip.html

It's not specialized for the particular task you're doing, but if you have something as simple as a series of bytes where a subset of the bits are always zeroes, I'd expect gzip to handle that about as well as a specialized compression algorithm would, and it has the benefit of being a universal format that other tools will have no problem reading.

Samwise
  • 68,105
  • 3
  • 30
  • 44
  • Nope, but I definitely will! I will be dealing with a lot of big datasets, many of which are much more complex, so I thought I'd ask for ideas and start playing around with the various options with a very simple one. – LC Nielsen Nov 04 '19 at 19:36
  • @LCNielsen: I expect that with compressors interpreting the input as a sequence of bytes, any prior packing "straddling byte boundaries" is harmful. – greybeard Dec 07 '19 at 03:14
2

Could you package/bit-shift two 12-bit integers into an array of three bytes (24 bits), and then use bit shifting to get the upper and lower 12 bits?

I imagine such an encoding would also compress well, on top of the space savings from encoding, given the redundancy, or if your data are particularly sparse or integers are distributed in a certain way.

I don't know a ton about numpy but from a cursory look, I believe it can store arrays of bytes, and there are bit shifting operands available in Python. If performance is a requirement, one could look into Cython for C-based bit operations on an unsigned char * within Python.

You'd need to figure out offsets, so that you always get on the correct starting byte, or you'd get something like a "frameshift mutation", to use a biological metaphor. That could be bad news for TB-sized data containers.

Alex Reynolds
  • 95,983
  • 54
  • 240
  • 345
  • That's a good idea to test. It would be interesting to see if 24-bit arrays compress better than if I just stored it as 16-bit integers which are effectively sequences of 12 bits and 4 zeroes. – LC Nielsen Nov 04 '19 at 19:35
  • Encoding to 24 bits would remove those zeroes. You'd gain compression benefits from whatever redundancy is left in the series of three-byte chunks, effectively. Encoding would help you most of all if a lot of your data are sparse or repetitive. Just be careful about data validation, given the overall size of your data. Maybe use a checksum or hash of blocks of three-byte chunks, so that you can validate your data is correct when extracting it for downstream work. – Alex Reynolds Nov 04 '19 at 19:59
  • Yes, my plan is to pack it into ~44 or 32 MB bunches (depending on how much I process it), possibly several bunches spliced into a single block of data depending on organization, with maybe 48 bytes of metadata and checksums at regular intervals. – LC Nielsen Nov 04 '19 at 20:23