1

I have a large numpy array (188,995 values to be exact) containing 18-digit integers. Here would be the first 5:

array([873205635515447425, 872488459744513265, 872556415745513809,
       872430459826834345, 867251246913838889])

The array's dtype is dtype('int64'). I'm currently storing this array in a .npy file that's 1.5mb in size.

I'll be storing a couple of these arrays every day, and I want to be conscious of storage. If it helps, the integers are always 18-digits long. They don't have any discernible pattern, so dividing them down won't work.

I was able to decrease the file size to 1.4mb by gzip compressing and storing as a .npy.gz file, but that's the lowest it'll go.

Is there a way to compress the array down further?

mmz
  • 1,011
  • 1
  • 8
  • 21
  • Check this out https://stackoverflow.com/a/40542980/2640045. – Lukas S Sep 18 '21 at 14:39
  • yep saw that answer and that's exactly how I got down to 1.4mb. was wondering whether I was missing any other methods – mmz Sep 18 '21 at 14:44
  • 2
    The `npy` file contains a byte image of that array, that is 8 bytes per number, same as the in memory. Your numbers are close to the upper limit of what a `int64` can store (without overflow). There isn't much room for compression without loss of precision. Images and text have a lot of "redundancy" that can be compressed out, your data does not. – hpaulj Sep 18 '21 at 15:36
  • thanks @hpaulj you're right. for any future readers, I found that converting the array and saving as a parquet file with brotli compression gave me the lowest file size, 1.3mb – mmz Sep 18 '21 at 16:46
  • Can you provide more of these values? If possible a whole chunk. – Jérôme Richard Sep 19 '21 at 14:07

0 Answers0