Memory error pickle dump while saving/loading data from/into disk

Question

l have a dataset of 40,000 examples dataset=(40.000,2048). After a process l would like to store and load dataset efficiently. Dataset is in an numpy format

l used pickle to store this dataset but it takes time to store and more time to load it. I even get memory error.

l tried to split the dataset into several sample as follow :

with open('dataset_10000.sav', 'wb') as handle:
    pickle.dump(train_frames[:10000], handle, protocol=pickle.HIGHEST_PROTOCOL)

with open('dataset_20000.sav', 'wb') as handle:
    pickle.dump(train_frames[10000:20000], handle, protocol=pickle.HIGHEST_PROTOCOL)

with open('dataset_30000.sav', 'wb') as handle:
    pickle.dump(train_frames[20000:30000], handle, protocol=pickle.HIGHEST_PROTOCOL)

with open('dataset_35000.sav', 'wb') as handle:
    pickle.dump(train_frames[30000:35000], handle, protocol=pickle.HIGHEST_PROTOCOL)

with open('dataset_40000.sav', 'wb') as handle:
    pickle.dump(train_frames[35000:], handle, protocol=pickle.HIGHEST_PROTOCOL)

However l get a memory error and its too heavy.

What is the best/optimized way to save/load such a huge data from/into disk ?

@juanpa.arrivillaga yes as described in my first paragraph. It's a numpy array — Joseph, Feb 23 '18 at 19:36
Using HDF5. Please add how you want to read your data after writing. For example iterating over the first or the second dimension. — max9111, Feb 23 '18 at 20:54
I think in this example: https://stackoverflow.com/a/48954998/4045774 should answer the question how to read and write data into the HDF5 format. If not feel free to ask — max9111, Feb 25 '18 at 15:34

score 1 · Answer 1 · answered Feb 23 '18 at 19:37

1

For numpy.ndarray objects, use numpy.save which you should prefer over pickle anyway, since it is more portable.It should be faster and require less memory in the serialization process.

You can then load it with numpy.load which even provides a memmap option, allowing you to work with arrays that are larger than can fit into memory.

answered Feb 23 '18 at 19:37

juanpa.arrivillaga

88,713
10
131
172

can l save at one time . l mean all the 40000 data ? and what is the extension of the file while using numpy.save ? – Joseph Feb 23 '18 at 19:40
@Joseph yes, save all of it. you can call it whatever you want, but the extension is `.npy` which is a `numpy` specific binary serialization format which retains information like endianess to make your serialized data portable. Note, this is preferable to `pickle` since `pickle` would require essentially a duplication of the data in memory. – juanpa.arrivillaga Feb 23 '18 at 19:43
what about hdf5 ? – Joseph Feb 23 '18 at 19:46
@Joseph what about it? – juanpa.arrivillaga Feb 23 '18 at 19:47
what is the fmt option in numpy save ? is it the same of float and string ? – Joseph Feb 23 '18 at 19:49
@Joseph what? there is no `fmt` option. It isn't like `np.savetxt`, which essentially creates a csv. This is a `numpy` -specific binary serialization format which allows your arrays to be reconstructed correctly on different architectures. See [here](https://docs.scipy.org/doc/numpy/neps/npy-format.html) – juanpa.arrivillaga Feb 23 '18 at 19:51
Thank you for clarification. I run my code for np.save l let you know – Joseph Feb 23 '18 at 19:57
Can I use this yo save a list of numpy arrays? – arilwan Mar 05 '20 at 10:03

Memory error pickle dump while saving/loading data from/into disk

1 Answers1