0

Say I have a Torch tensor of integers in a small range 0,...,R (e.g., R=31).

I want to store to disk in compressed form in a way that is close to the entropy of the vector.

The compression techniques I know (e.g., Huffman and arithmetic coding) all seem to be serial in nature.

  • Is there a fast Torch entropy compression implementation?

I'm happy to use an off the shelf implementation, but I can also try to implement myself if someone knows a suitable algorithm.

M A
  • 209
  • 2
  • 7
  • Would you please clarify you range notation? Is it really just integers between (inclusively) `0` and `31`? And please add information about the size of your tensor. – Wolf Aug 11 '21 at 10:47
  • Why do you ask for the “fastest way” instead of specifying what is fast enough for you? – Wolf Aug 11 '21 at 17:03

2 Answers2

0

torch.save will store it with pickle protocol. If you want to save space, to quantize these vectors before saving should help.

Also, you can try zlib module:

https://github.com/jonathantompson/torchzlib

One alternative is to transform it to numpy arrays and then use some of the compression methods available there.

Refer to:

Compress numpy arrays efficiently

Wolf
  • 9,679
  • 7
  • 62
  • 108
Guinther Kovalski
  • 1,629
  • 1
  • 7
  • 15
  • 1
    Thanks, but wouldn't the numpy compressions run serially and not leverage the GPU? It's not true that in general you cannot save space compared to pickle, if you are willing to run a compression algorithm. The problem is that the ones I know are too slow. – M A Aug 10 '21 at 18:38
  • 1
    I/O operations are very unlikely to use any GPU processing. One possible path is to run the compression algorithm before save the vector. Quantization is one very common compressing technique available in pytorch. It will only convert double floats to floats or integers, keeping the most of the original information. https://pytorch.org/docs/stable/quantization.html – Guinther Kovalski Aug 10 '21 at 18:45
  • Indeed, I want to compress the vector before saving it. I think that these quantizations do not look at the entropy of the vector and potentially introduce an error. I'm looking for lossless compression. – M A Aug 10 '21 at 20:25
  • correct me if i'm wrong, but, given that you have a bunch of "small range" tensors, you can make use multi thread to parallelize the process of open -> compress -> save. – Guinther Kovalski Aug 10 '21 at 20:48
  • They come one at a time and I want to store it with minimal overheads. – M A Aug 10 '21 at 22:33
0

For what you described, you can simply pack five-bit integers into a bit stream. It's easy to compress and decompress with the shift, or, and and bitwise operators (<<, >>, |, &). That would be as good as you could do, if your integers are uniformly distributed in 0..31, and there are no repeated patterns.

If, on the other hand, the distribution of your integers is significantly skewed or there are repeated patterns, then you should use an existing lossless compressor, such as zlib, zstd, or lzma2 (xz). For any of those, feed them one integer per byte.

To parallelize the computation, you can break up your 225 integers into many small subsets, each of which can be compressed independently. You could go down a few 10's of K each, likely with little overhead loss or compression loss. You will need to experiment with your data.

Mark Adler
  • 101,978
  • 13
  • 118
  • 158
  • I want to compress the vector to less than 5 bits per coordinate (it's entropy is a lot lower). – M A Aug 10 '21 at 20:23
  • 1
    Then you will need to use Huffman or some other entropy coding. You should use an off-the-shelf compressor, like zlib or zstd. Feed the compressor one integer per byte. – Mark Adler Aug 10 '21 at 20:30
  • 1
    I don't know what your concern is about "serial in nature". Of course it's serial. – Mark Adler Aug 10 '21 at 20:31
  • So my question was if there's a compression algorithm that's GPU-friendly and would work faster than serial implementations of Huffman or arithmetic coding. – M A Aug 10 '21 at 22:32
  • P.S. Why is it obvious that it's serial? – M A Aug 10 '21 at 22:32
  • 2
    Because bit streams must be generated serially, since the bit position at any point depends on the data the precedes it. If you want to break up the input into blocks for parallel processing, that is totally doable. How large is the vector? – Mark Adler Aug 10 '21 at 23:18
  • 1
    About 2^25 entries. – M A Aug 11 '21 at 10:47
  • 1
    @MA Please add this information to the question. Thanks. – Wolf Aug 11 '21 at 10:48