Why is dumping with `pickle` much faster than `json`?

Question

This is for Python 3.6.

Edited and removed a lot of stuff that turned out to be irrelevant.

I had thought json was faster than pickle and other answers and comments on Stack Overflow make it seem like a lot of other people believe this as well.

Is my test kosher? The disparity is much larger than I expected. I get the same results testing on very large objects.

import json
import pickle
import timeit

file_name = 'foo'
num_tests = 100000

obj = {1: 1}

command = 'pickle.dumps(obj)'
setup = 'from __main__ import pickle, obj'
result = timeit.timeit(command, setup=setup, number=num_tests)
print("pickle: %f seconds" % result)

command = 'json.dumps(obj)'
setup = 'from __main__ import json, obj'
result = timeit.timeit(command, setup=setup, number=num_tests)
print("json:   %f seconds" % result)

and the output:

pickle: 0.054130 seconds
json:   0.467168 seconds

Because `JSON` is text serialization format designed to be human readable and portable. `pickle` is a binary representation, that's designed to be efficient but is restricted to only Python. I don't know why you would expect `JSON` to be faster. — juanpa.arrivillaga, Mar 27 '17 at 21:27
[Going the opposite way.](http://stackoverflow.com/questions/18517949/what-is-faster-loading-a-pickled-dictionary-object-or-loading-a-json-file-to) — pingul, Mar 27 '17 at 21:29
Neither is built for speed -- if you care about fast and compact, consider [msgpack](http://msgpack.org/). — Charles Duffy, Mar 27 '17 at 21:33
@pingul I'd say that question is pretty outdated. I'd like to see the comparison on Python 3.6 where `pickle == cpickle` and the lastest pickling protocols are available (which should produce shorter and faster pickles than were available o Python 2.) — juanpa.arrivillaga, Mar 27 '17 at 21:34
@juanpa.arrivillaga Yes it has a couple of years. The code is however posted there, so if anyone wants to give it another spin -- go ahead :) — pingul, Mar 27 '17 at 21:37
@pingul i've updated the code to test on Python 3 with the highest pickling protocol. I'm getting about a 2x speed-up, using a relatively small data structure. I imagine this difference gets more dramatic the larger the object. [Here is the gist](https://gist.github.com/juanarrivillaga/3f07e5b7d2cd932dee1bd799c3bd31cc) — juanpa.arrivillaga, Mar 27 '17 at 21:56
This isn't a good test. Mostly it's just testing some irrelevant overhead. — Dietrich Epp, Mar 27 '17 at 21:59
@DietrichEpp sure, but it was meant to reproduce the test in the link. — juanpa.arrivillaga, Mar 27 '17 at 22:00
@juanpa.arrivillaga: That's not much of a justification… just because a bad test is a copy of another bad test does not make it useful. — Dietrich Epp, Mar 27 '17 at 22:35
@DietrichEpp fair enough, but at least it shows that the deserialization overhead that is being measured is much lower using the highest pickle protocol on Python 3. My biggest point was that the link was outdated. If anyone want's to test a better example, they are free to create some large object, serialize it, then use that code to test it. — juanpa.arrivillaga, Mar 27 '17 at 22:38

score 4 · Answer 1 · answered Dec 12 '17 at 02:41

I have tried several methods based on your code snippet and found out that using cPickle with setting the protocol argument of the dumps method as: cPickle.dumps(obj, protocol=cPickle.HIGHEST_PROTOCOL) is the fastest dump method.

import msgpack
import json
import pickle
import timeit
import cPickle
import numpy as np

num_tests = 10

obj = np.random.normal(0.5, 1, [240, 320, 3])

command = 'pickle.dumps(obj)'
setup = 'from __main__ import pickle, obj'
result = timeit.timeit(command, setup=setup, number=num_tests)
print("pickle:  %f seconds" % result)

command = 'cPickle.dumps(obj)'
setup = 'from __main__ import cPickle, obj'
result = timeit.timeit(command, setup=setup, number=num_tests)
print("cPickle:   %f seconds" % result)


command = 'cPickle.dumps(obj, protocol=cPickle.HIGHEST_PROTOCOL)'
setup = 'from __main__ import cPickle, obj'
result = timeit.timeit(command, setup=setup, number=num_tests)
print("cPickle highest:   %f seconds" % result)

command = 'json.dumps(obj.tolist())'
setup = 'from __main__ import json, obj'
result = timeit.timeit(command, setup=setup, number=num_tests)
print("json:   %f seconds" % result)


command = 'msgpack.packb(obj.tolist())'
setup = 'from __main__ import msgpack, obj'
result = timeit.timeit(command, setup=setup, number=num_tests)
print("msgpack:   %f seconds" % result)

Output:

pickle         :   0.847938 seconds
cPickle        :   0.810384 seconds
cPickle highest:   0.004283 seconds
json           :   1.769215 seconds
msgpack        :   0.270886 seconds

For a future users, I ran this for Python 3.5 with `msgpack` 0.5.6 and `pickle` (equivalent to `cPickle` in Python 3) and `pickle` was now faster than `msgpack`: — Paul, Nov 28 '18 at 17:18
I tried small dict and msgpack was a tiny bit faster. But not by much. So I guess `pickle` is better choice most of the time, if you use python structures. Another thing to note about msgpack and json too, is that it can actually change structure when using dumps/loads, because for example it converts tuple into list. `pickle` correctly handles it by default. — Andrius, Jul 28 '20 at 13:34

score 0 · Answer 2 · answered Mar 27 '17 at 21:30

0

JSON serialises in a human readable way. pickle serialises in a binary representation. Nevertheless pickle often is pretty slow. There are variants like cPickle that are faster. If you want even better serialisation, use msgpack.

answered Mar 27 '17 at 21:30

yar

1,855
13
26

7

In python3, `pickle == cpickle` – juanpa.arrivillaga Mar 27 '17 at 21:31

score -1 · Answer 3 · answered Mar 27 '17 at 21:29

-1

How many times did you run the benchmarking? In any case you need to remove random delays that get introduced by thread blocking etc. You can do so by running your benchmark sufficiently high number of times. Also your input is too small to suppress any delays of 'boiler-plate' code.

answered Mar 27 '17 at 21:29

pranav3688

694
1
11
20

Yes I ran it many more times locally. For this application it doesn't matter how many times the test is run because it's so (relatively) slow and the difference in the two results is an entire order of magnitude. Don't be dense. – user910210 Mar 27 '17 at 21:37
3

@user910210 No need to be rude. – pingul Mar 27 '17 at 22:03

Why is dumping with `pickle` much faster than `json`?

3 Answers3

Linked

Related