0

I have a list which contains a very large numpy array, a very small numpy array, and some fields which are very small in size. I wish to save my list as a file and load it later on. If I use pickle as described in how to save/read class wholly in Python, then (although it loads fast) the saved file is way too large.

Question: What is the best way of saving such a list from a space point of view if we do not compress the file?

An example of the described list that is saved using pickle.

import pickle
import numpy as np

class MyClass:
    def __init__(self):
        self.largearray = np.random.rand(10000,10000,3) * 255
        self.smallarray = np.random.rand(100,100,3) * 255
        self.attribute1 = True
        self.attribute2 = 'Some String'
        self.attribute3 = 888
        self.list = [self.largearray, self.smallarray, self.attribute1, self.attribute2, self.attribute3]

a = MyClass()

with open(f'test.pickle', 'wb') as file:
    pickle.dump(a.list, file) 

with open(f'test.pickle', 'rb') as file2:
    a_loaded = pickle.load(file2)

Edit: As the comment points out, the problem comes from numpy rather than pickle. I should convert numpy to some other data structure such that it does not take too much space and it can be quickly converted to numpy when loaded. What is the best structure to achieve it?

温泽海
  • 216
  • 3
  • 16
  • 2
    Does this help? https://stackoverflow.com/a/38068727/12229158 – whege May 18 '22 at 19:44
  • Convert the 64-bit numpy array to a 32-bit array. – DYZ May 18 '22 at 19:55
  • 1
    Look at `self.largearray.nbytes`. This should roughly match the space taken up by the array in the file. `np.save('largearray.npy`, self.largearray)` is a direct way of saving just the array. Changing the `dtype` can reduce the size (`nbytes`), by factor of 2, 4, etc. `np.save` (and extension `pickle`) writes a simple copy of the data of the array to the file. There isn't a more "compact" format. – hpaulj May 18 '22 at 20:10
  • This is simply not possible without a loss of accuracy or compression (assuming pickle does its job properly, otherwise, please use Numpy functions so store the array). This is the only two possible option. Your `largearray` array takes 2.2 GiB in RAM which is pretty huge because of its shape and because each items is a 64-bit float. You can use 32-bit float but this will reduce the accuracy. You can even use 16-bit float if you do not care much about the accuracy of the result. – Jérôme Richard May 18 '22 at 20:11

0 Answers0