0

l have a dataset of 40,000 examples dataset=(40.000,2048). After a process l would like to store and load dataset efficiently. Dataset is in an numpy format

l used pickle to store this dataset but it takes time to store and more time to load it. I even get memory error.

l tried to split the dataset into several sample as follow :

with open('dataset_10000.sav', 'wb') as handle:
    pickle.dump(train_frames[:10000], handle, protocol=pickle.HIGHEST_PROTOCOL)

with open('dataset_20000.sav', 'wb') as handle:
    pickle.dump(train_frames[10000:20000], handle, protocol=pickle.HIGHEST_PROTOCOL)

with open('dataset_30000.sav', 'wb') as handle:
    pickle.dump(train_frames[20000:30000], handle, protocol=pickle.HIGHEST_PROTOCOL)

with open('dataset_35000.sav', 'wb') as handle:
    pickle.dump(train_frames[30000:35000], handle, protocol=pickle.HIGHEST_PROTOCOL)

with open('dataset_40000.sav', 'wb') as handle:
    pickle.dump(train_frames[35000:], handle, protocol=pickle.HIGHEST_PROTOCOL)

However l get a memory error and its too heavy.

What is the best/optimized way to save/load such a huge data from/into disk ?

Joseph
  • 343
  • 6
  • 18

1 Answers1

1

For numpy.ndarray objects, use numpy.save which you should prefer over pickle anyway, since it is more portable.It should be faster and require less memory in the serialization process.

You can then load it with numpy.load which even provides a memmap option, allowing you to work with arrays that are larger than can fit into memory.

juanpa.arrivillaga
  • 88,713
  • 10
  • 131
  • 172
  • can l save at one time . l mean all the 40000 data ? and what is the extension of the file while using numpy.save ? – Joseph Feb 23 '18 at 19:40
  • @Joseph yes, save all of it. you can call it whatever you want, but the extension is `.npy` which is a `numpy` specific binary serialization format which retains information like endianess to make your serialized data portable. Note, this is preferable to `pickle` since `pickle` would require essentially a duplication of the data in memory. – juanpa.arrivillaga Feb 23 '18 at 19:43
  • what about hdf5 ? – Joseph Feb 23 '18 at 19:46
  • @Joseph what about it? – juanpa.arrivillaga Feb 23 '18 at 19:47
  • what is the fmt option in numpy save ? is it the same of float and string ? – Joseph Feb 23 '18 at 19:49
  • @Joseph what? there is no `fmt` option. It isn't like `np.savetxt`, which essentially creates a csv. This is a `numpy` -specific binary serialization format which allows your arrays to be reconstructed correctly on different architectures. See [here](https://docs.scipy.org/doc/numpy/neps/npy-format.html) – juanpa.arrivillaga Feb 23 '18 at 19:51
  • Thank you for clarification. I run my code for np.save l let you know – Joseph Feb 23 '18 at 19:57
  • Can I use this yo save a list of numpy arrays? – arilwan Mar 05 '20 at 10:03