0

I have large data set.

The best I could achieve is use numpy arrays and make a binary file out of it and then compressing it:

my_array = np.array([1.0, 2.0, 3.0, 4.0])
my_array.tobytes()
my_array = zlib.compress(my_array)

With my real data, however, the binary file becomes 22mb size so I am hoping to make it even smaller. I found that by default 64bit machines use the float64 which takes up 24 bytes in memory, 8 bytes for pointer to the value, 8 bytes for the double precision and 8 bytes for the garbage collector. If I change it to float32 I gain a lot in memory but lose in precision, I am not sure if I want that, but what about the 8 bytes for garbage collector, is it automatically stripped away?

Observations: I have already tried pickle, hickle, msgpack but 22mb is the best size I managed to reach.

Mad Physicist
  • 107,652
  • 25
  • 181
  • 264
Akira Kotsugai
  • 1,099
  • 3
  • 12
  • 19
  • 1
    A `float64` array does not have a pointer at each element. That's literally the point of having an array. They are densely packed. – Mad Physicist Apr 14 '20 at 17:47
  • How big is your array? – Mad Physicist Apr 14 '20 at 17:47
  • 46800 each containing 4x18 matrices, it can become much bigger though – Akira Kotsugai Apr 14 '20 at 17:52
  • Is the shape `(46800, 4, 18)`? Please edit the question to include that information. 22MB is meaningless without that. – Mad Physicist Apr 14 '20 at 17:55
  • I flattened everything into one array because I gained a bit of space by doing that. so no, it is one array with enough meta data to rebuild a (46800, 4, 18) array. that is why I did not include this information in the question – Akira Kotsugai Apr 14 '20 at 18:01
  • The metadata should be no more than 64*3 + 1 bytes: ndim (1 byte), 3 elements of shape. – Mad Physicist Apr 14 '20 at 18:02
  • An example if performance also matters: https://stackoverflow.com/a/56761075/4045774 Can you give an example of your real data as the achievable compression ratios heavily depends on it. – max9111 Apr 15 '20 at 10:36
  • Please note that `m` is the SI prefix for `milli` and `b` is the SI unit for `bits`, so you are saying your file is 22 millibits... as opposed to 22MB or 22 megabytes. – Mark Setchell Apr 15 '20 at 14:08

1 Answers1

3

An array with 46800 x 4 x 18 8-byte floats takes up 26956800 bytes. That's 25.7MiB or 27.0MB. A compressed size of 22MB is an 18% (or 14% if you really meant MiB) compression, which is pretty good by most standards, especially for random binary data. You are unlikely to improve on that much. Using a smaller datatype like float32, or perhaps trying to represent your data as rationals may be useful.

Since you mention that you want to store metadata, you can record a byte for the number of dimensions (numpy allows at most 32 dimensions), and N integers for the size in each dimension (either 32 or 64 bit). Let's say you use 64 bit integers. That makes for 193 bytes of metadata in your particular case, or 7*10-4% of the total array size.

Mad Physicist
  • 107,652
  • 25
  • 181
  • 264