Why does pickle take so much longer than np.save?

Question

I want to save a dict or arrays.

I try both with np.save and with pickle and see that the former always take much less time.

My actual data is much bigger but I just present a small piece here for demonstration purposes:

import numpy as np
#import numpy.array as array
import time
import pickle

b = {0: [np.array([0, 0, 0, 0])], 1: [np.array([1, 0, 0, 0]), np.array([0, 1, 0, 0]), np.array([0, 0, 1, 0]), np.array([0, 0, 0, 1]), np.array([-1,  0,  0,  0]), np.array([ 0, -1,  0,  0]), np.array([ 0,  0, -1,  0]), np.array([ 0,  0,  0, -1])], 2: [np.array([2, 0, 0, 0]), np.array([1, 1, 0, 0]), np.array([1, 0, 1, 0]), np.array([1, 0, 0, 1]), np.array([ 1, -1,  0,  0]), np.array([ 1,  0, -1,  0]), np.array([ 1,  0,  0, -1])], 3: [np.array([1, 0, 0, 0]), np.array([0, 1, 0, 0]), np.array([0, 0, 1, 0]), np.array([0, 0, 0, 1]), np.array([-1,  0,  0,  0]), np.array([ 0, -1,  0,  0]), np.array([ 0,  0, -1,  0]), np.array([ 0,  0,  0, -1])], 4: [np.array([2, 0, 0, 0]), np.array([1, 1, 0, 0]), np.array([1, 0, 1, 0]), np.array([1, 0, 0, 1]), np.array([ 1, -1,  0,  0]), np.array([ 1,  0, -1,  0]), np.array([ 1,  0,  0, -1])], 5: [np.array([0, 0, 0, 0])], 6: [np.array([1, 0, 0, 0]), np.array([0, 1, 0, 0]), np.array([0, 0, 1, 0]), np.array([0, 0, 0, 1]), np.array([-1,  0,  0,  0]), np.array([ 0, -1,  0,  0]), np.array([ 0,  0, -1,  0]), np.array([ 0,  0,  0, -1])], 2: [np.array([2, 0, 0, 0]), np.array([1, 1, 0, 0]), np.array([1, 0, 1, 0]), np.array([1, 0, 0, 1]), np.array([ 1, -1,  0,  0]), np.array([ 1,  0, -1,  0]), np.array([ 1,  0,  0, -1])], 7: [np.array([1, 0, 0, 0]), np.array([0, 1, 0, 0]), np.array([0, 0, 1, 0]), np.array([0, 0, 0, 1]), np.array([-1,  0,  0,  0]), np.array([ 0, -1,  0,  0]), np.array([ 0,  0, -1,  0]), np.array([ 0,  0,  0, -1])], 8: [np.array([2, 0, 0, 0]), np.array([1, 1, 0, 0]), np.array([1, 0, 1, 0]), np.array([1, 0, 0, 1]), np.array([ 1, -1,  0,  0]), np.array([ 1,  0, -1,  0]), np.array([ 1,  0,  0, -1])]}


start_time = time.time()
with open('testpickle', 'wb') as myfile:
    pickle.dump(b, myfile)
print("--- Time to save with pickle: %s milliseconds ---" % (1000*time.time() - 1000*start_time))

start_time = time.time()
np.save('numpy', b)
print("--- Time to save with numpy: %s milliseconds ---" % (1000*time.time() - 1000*start_time))

start_time = time.time()
with open('testpickle', 'rb') as myfile:
    g1 = pickle.load(myfile)
print("--- Time to load with pickle: %s milliseconds ---" % (1000*time.time() - 1000*start_time))

start_time = time.time()
g2 = np.load('numpy.npy')
print("--- Time to load with numpy: %s milliseconds ---" % (1000*time.time() - 1000*start_time))

which gives an output:

--- Time to save with pickle: 4.0 milliseconds ---
--- Time to save with numpy: 1.0 milliseconds ---
--- Time to load with pickle: 2.0 milliseconds ---
--- Time to load with numpy: 1.0 milliseconds ---

The time difference is even more pronounced with my actual size (~100,000 keys in the dict).

Why does pickle take longer than np.save, both for saving and for loading?

When should I use pickle?

score 6 · Accepted Answer · answered Aug 14 '18 at 09:14

6

Because as long as the written object contains no Python data,

numpy objects are represented in memory in a much simpler way than Python objects
numpy.save is written in C
numpy.save writes in a supersimple format that needs minimal processing

meanwhile

Python objects have a lot of overhead
pickle is written in Python
pickle transforms the data considerably from the underlying representation in memory to the bytes being written on the disk

Note that if a numpy array does contain Python objects, then numpy just pickles the array, and all the win goes out the window.

answered Aug 14 '18 at 09:14

Amadan

191,408
23
240
301

By 'objects' you mean methods, functions etc? – SuperCiocia Aug 14 '18 at 09:16
I mean any value that is not a numpy value. A value that triggers `.dtype.hasobject` becoming true. For example, `np.array([1, "foo"])` is okay, `np.array([lambda x: x + 1])` and `np.array([{}])` are not. – Amadan Aug 14 '18 at 09:18
Sorry for the ignorance but what do you mean exactly by the last 2 examples "not being okay"? anbd what do you mean by 'python objects have a lot of overhead'? – SuperCiocia Aug 14 '18 at 11:45
I mean `np.array([{}]).dtype.hasobject` is `True`, and thus `np.save` would use pickle to represent it instead its own representation, which in turn means it is actually slightly *slower* than pickle. – Amadan Aug 14 '18 at 11:48
Regarding overhead, Python is a dynamic language, and Python objects need to have lots of extra info that C doesn't (because compiler takes care of code knowing where and what everything is). E.g. for `a = list(range(100))`, compare `sys.getsizeof(a) + sum(sys.getsizeof(x) for x in a)` (size of Python list in memory) vs `np.array(a).nbytes` (size of the same numpy array). If you do `np.array(a, dtype=np.uint8).nbytes`, even bigger diff. But change it to a non-numpy type (e.g. dicts), numpy stops being able to store them in-place, and stores pointers: e.g. for `a = [{i: i} for i in range(100)]` – Amadan Aug 14 '18 at 12:05
and you need to start counting the python sizes again: `np.array(a).nbytes + sum(sys.getsizeof(d) + sum(sys.getsizeof(k) + sys.getsizeof(v) for k, v in d.items()) for d in a)`, which is not that different (or possibly even larger?) than the corresponding Python list (whose size would be calculated exactly the same way, except for `sys.getsizeof(a)` instead of `np.array(a).nbytes`). – Amadan Aug 14 '18 at 12:07
In Py3, `cPickle` is the standard `pickle`. `save` is the `pickle` format for arrays. `save` uses `pickle.dumps` to write objects like dictionaries and lists. So with a complex structure like `b` `pickle` and `save` end up writing nearly the same things. But `save` has to first wrap the dictionary in an object array. – hpaulj Aug 14 '18 at 15:48

score 4 · Answer 2 · answered Aug 14 '18 at 15:56

I think you need better timings. I also disagree with the accepted answer.

b is a dictionary with 9 keys; the values are lists of arrays. That means both pickle.dump and np.save will be using each other - pickle uses save to pickle the arrays, save uses pickle to save the dictionary and list.

save writes arrays. That means it has to wrap your dictionary in a object dtype array in order to save it.

In [6]: np.save('test1',b)
In [7]: d=np.load('test1.npy')
In [8]: d
Out[8]: 
array({0: [array([0, 0, 0, 0])], 1: [array([1, 0, 0, 0]), array([0, 1, 0, 0]), .... array([ 1, -1,  0,  0]), array([ 1,  0, -1,  0]), array([ 1,  0,  0, -1])]},
      dtype=object)
In [9]: d.shape
Out[9]: ()
In [11]: list(d[()].keys())
Out[11]: [0, 1, 2, 3, 4, 5, 6, 7, 8]

Some timings:

In [12]: timeit np.save('test1',b)
850 µs ± 36.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [13]: timeit d=np.load('test1.npy')
566 µs ± 6.44 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [20]: %%timeit 
    ...: with open('testpickle', 'wb') as myfile:
    ...:     pickle.dump(b, myfile)
    ...:     
505 µs ± 9.24 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [21]: %%timeit 
    ...: with open('testpickle', 'rb') as myfile:
    ...:     g1 = pickle.load(myfile)
    ...:     
152 µs ± 4.83 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In my timings pickle is faster.

The pickle file is slightly smaller:

In [23]: ll test1.npy testpickle
-rw-rw-r-- 1 paul 5740 Aug 14 08:40 test1.npy
-rw-rw-r-- 1 paul 4204 Aug 14 08:43 testpickle

score 0 · Answer 3 · answered Aug 14 '18 at 09:18

This is because pickle works on all sorts of Python objects and is written in pure Python, whereas np.save is designed for arrays and saves them in an efficient format.

From the numpy.save documentation, it can actually use pickle behind the scenes. This may limit portability between versions of Python and runs the risk of executing arbitrary code (which is a general risk when unpickling an unknown object).

Useful reference: This answer

Why does pickle take so much longer than np.save?

3 Answers3

Linked

Related