0

I am trying to compress some .npy files as tightly as possible. What I have read is that typically to do this you use the LZMA algorithm.

So far I have tried xz tar compression level 9, and python lzma compression. This seems effective but I was wondering if anybody had tried something better? Is LZMA really the best algorithm or is there something better? I am optimizing SOLELY for compression, time to compress is a non-issue. I also recognize that .npy is already more compressed than, for example, an image so there is a limit to the opimality of the result.

I am dealing with both folders of npy files and single npy files alone.

Edit: The .npy files contain hyperspectral images from the Harvard Real World Hyperspectral Image Dataset stacked together

Mira Welner
  • 409
  • 4
  • 15
  • 1
    "Already more compressed than an image": Um, no. "Image" is a type of data, "npy" is a general way of saving numerical data to disk, so that comparison doesn't really hold – "image" and "npy file" are simply from two different categories. But: we can help you compress better! How well something compresses depends on how well your compression algorithm can represent your data. So, what's in your .npy file? Find a way to represent them more efficiently and you're all set :) There's no general way to compress npy files – if the bits in them are truly random uncorrelated bits, it's mathematical – Marcus Müller May 30 '23 at 21:03
  • What sort of arrays are you storing? dtype, shape, proportion of 0's etc, Apart from a small header block, most of the `npy` file is a copy of the array's data buffer. For an array full of floats or differing ints, there won't be much of a 'regular pattern' that can be compressed out. This is different from text or images with lot's of "blank" spacee. – hpaulj May 30 '23 at 21:04
  • impossible to compress them. Luckily, that's probably not the case for your data. The better you can describe the data, the better you can find a method for squeezing out all the bits that don't actually contribute to the information. This is the principle of *source coding* :) – Marcus Müller May 30 '23 at 21:04
  • Note that this is probably *not* a programming question, but, depending on what kind of data we're talking about, might be a computer science or signal processing question. If you can describe the data in these files as a signal, try asking with as much description as you can spare on our sister site https://dsp.stackexchange.com ; there's a great lot of experience in signal compression over there. – Marcus Müller May 30 '23 at 21:05
  • Could you provide a sample of those files? – Zerte Jul 15 '23 at 09:41

1 Answers1

1

You should start by using your knowledge of the content of your NumPy arrays before they are even stored in an npy file. Are all of the bits in the hyperspectral image data significant? For example, if they consist of 64-bit floating point numbers, then almost certainly not. In that case a half or more of what you're saving is noise, which can't be compressed, and which also is not useful. You could first transform them to keep only the significant bits. Are adjacent pixels correlated? Almost certainly so. You could store only differences of the pixels, from the last one to the left, or a more sophisticated filter using pixels above and the left (see the PNG format for examples), or you could use a 2D cosine or wavelet transform to isolate higher frequency components, which you may be able to store in fewer bits. Are planes of the images correlated? Are successive images correlated? Anything else you can take advantage of?

Then you can try to apply various lossless compression methods. Make sure that the format you save them in is not compressed. (As far as I can tell, .npy is not compressed, but .npz is.) Beyond LZMA, you can try PPMd, which may or may not give you better performance. But it will definitely meet your requirement of being slower!

Mark Adler
  • 101,978
  • 13
  • 118
  • 158
  • `.npz` is a zip archive of `npy` files. It may be compressed, that's done with zipfile.ZIP_DEFLATED (see docs for `savez_compressed`) – hpaulj May 31 '23 at 02:40
  • And? My point is, don't do that if you want to try other compression methods. – Mark Adler May 31 '23 at 02:46