42

I have been using pickle and was very happy, then I saw this article: Don't Pickle Your Data

Reading further it seems like:

I’ve switched to saving my data as JSON, but I wanted to know about best practice:

Given all these issues, when would you ever use pickle? What specific situations call for using it?

Community
  • 1
  • 1
e h
  • 8,435
  • 7
  • 40
  • 58
  • 2
    BTW, there are formats that are *far* more human-readable than JSON, and arguably easier to edit too. Both good old INI files and YAML come to mind. It's certainly better than an opaque binary stream, but human readability isn't a binary thing. –  Feb 13 '14 at 11:21
  • First downside I see for saving objects as JSON : You have to create your serializers, and that takes some time. Plus the speed of your JSON process to serialize might, in the end, be slower than a simple pickle. Though I agree on the security downside. Another point is : Why do you want to store an object and let it be editable ? Couldn't that be unsafe ? – Depado Feb 13 '14 at 12:11
  • 4
    Why use a hammer when you have a screwdriver ? Whay use a screwdriver when you have a hammer ? It's all about choosing the righ tool for the job at hand. – bruno desthuilliers Feb 13 '14 at 12:41
  • This is basically the same post as: http://stackoverflow.com/questions/8968884/python-serialization-why-pickle/19360828#19360828. If you are concerned about security, don't rely on pickle or JSON. Use a stronger authentication service -- something with an encryption key. – Mike McKerns Feb 20 '14 at 12:26
  • Given the extra work that pickle has to do (in comparison to e.g. oversimplified formats such a JSON) to make sure references to objects that are already represented are found, it is not slow at all. Whereas json throws an error on even extremely simple things like `import json; d = [1]; d.append(d); json.dumps(d)` – Anthon Feb 14 '17 at 13:04

5 Answers5

33

Pickle is unsafe because it constructs arbitrary Python objects by invoking arbitrary functions. However, this is also gives it the power to serialize almost any Python object, without any boilerplate or even white-/black-listing (in the common case). That's very desirable for some use cases:

  • Quick & easy serialization, for example for pausing and resuming a long-running but simple script. None of the concerns matter here, you just want to dump the program's state as-is and load it later.
  • Sending arbitrary Python data to other processes or computers, as in multiprocessing. The security concerns may apply (but mostly don't), the generality is absolutely necessary, and humans won't have to read it.

In other cases, none of the drawbacks is quite enough to justify the work of mapping your stuff to JSON or another restrictive data model. Maybe you don't expect to need human readability/safety/cross-language compatibility or maybe you can do without. Remember, You Ain't Gonna Need It. Using JSON would be the right thing™ but right doesn't always equal good.

You'll notice that I completely ignored the "slow" downside. That's because it's partially misleading: Pickle is indeed slower for data that fits the JSON model (strings, numbers, arrays, maps) perfectly, but if your data's like that you should use JSON for other reasons anyway. If your data isn't like that (very likely), you also need to take into account the custom code you'll need to turn your objects into JSON data, and the custom code you'll need to turn JSON data back into your objects. It adds both engineering effort and run-time overhead, which must be quantified on a case-by-case basis.

  • Thanks for a great answer. Good to know what is the right thing™ even if it does not always == good – e h Feb 13 '14 at 13:41
  • In `multiprocessing`, and in spark too. When working with RDDs, spark will serialize your user defined functions (passed to map, flatmap) [using pickle](https://spark.apache.org/docs/1.1.1/api/python/pyspark.serializers.PickleSerializer-class.html) because it can serialize almost any python object. – James Lim Dec 12 '17 at 02:53
6

Pickle has the advantage of convenience -- it can serialize arbitrary object graphs with no extra work, and works on a pretty broad range of Python types. With that said, it would be unusual for me to use Pickle in new code. JSON is just a lot cleaner to work with.

Sneftel
  • 40,271
  • 12
  • 71
  • 104
5

I have tried several methods and found out that using cPickle with setting the protocol argument of the dumps method as: cPickle.dumps(obj, protocol=cPickle.HIGHEST_PROTOCOL) is the fastest dump method.

import msgpack
import json
import pickle
import timeit
import cPickle
import numpy as np

num_tests = 10

obj = np.random.normal(0.5, 1, [240, 320, 3])

command = 'pickle.dumps(obj)'
setup = 'from __main__ import pickle, obj'
result = timeit.timeit(command, setup=setup, number=num_tests)
print("pickle:  %f seconds" % result)

command = 'cPickle.dumps(obj)'
setup = 'from __main__ import cPickle, obj'
result = timeit.timeit(command, setup=setup, number=num_tests)
print("cPickle:   %f seconds" % result)


command = 'cPickle.dumps(obj, protocol=cPickle.HIGHEST_PROTOCOL)'
setup = 'from __main__ import cPickle, obj'
result = timeit.timeit(command, setup=setup, number=num_tests)
print("cPickle highest:   %f seconds" % result)

command = 'json.dumps(obj.tolist())'
setup = 'from __main__ import json, obj'
result = timeit.timeit(command, setup=setup, number=num_tests)
print("json:   %f seconds" % result)


command = 'msgpack.packb(obj.tolist())'
setup = 'from __main__ import msgpack, obj'
result = timeit.timeit(command, setup=setup, number=num_tests)
print("msgpack:   %f seconds" % result)

Output:

pickle         :   0.847938 seconds
cPickle        :   0.810384 seconds
cPickle highest:   0.004283 seconds
json           :   1.769215 seconds
msgpack        :   0.270886 seconds

So, I prefer cPickle with the highest dumping protocol in situations that require real time performance such as video streaming from a camera to a server.

Ahmed Abobakr
  • 1,618
  • 18
  • 26
4

I usually use neither Pickle, nor JSON, but MessagePack it is both safe and fast, and produces serialized data of small size.

An additional advantage is possibility to exchange data with software written in other languages (which of course is also true in case of JSON).

wzab
  • 788
  • 7
  • 24
  • JSON's biggest advantage IMHO it is both concise (unlike XML) and human readable (unlike MessagePack). I'm not sure that the size saved by MessagePack is significant enough to negate those two benefits. – CadentOrange Feb 13 '14 at 16:41
  • It isn't so much as size savings in MessagePack, but that you can encode things that JSON doesn't do well, like binary data. – Joe Apr 28 '14 at 20:42
  • 2
    MessagePack can't serialize `set`s, what a shame – Display Name Sep 06 '15 at 12:59
2

You can find some answer on JSON vs. Pickle security: JSON can only pickle unicode, int, float, NoneType, bool, list and dict. You can't use it if you want to pickle more advanced objects such as classes instance. Note that for those kinds of pickle, there is no hope to be language agnostic.

Also using cPickle instead of Pickle partially resolve the speed progress.

Community
  • 1
  • 1
hivert
  • 10,579
  • 3
  • 31
  • 56
  • I though cPickle was quicker too, then I saw: http://stackoverflow.com/questions/16833124/pickle-faster-than-cpickle-with-numeric-data – e h Feb 13 '14 at 12:38