4

I have noticed that loading a dictionary of 5000 objects with pickle takes a long time (minutes) -- but loading a json of file of 5000 entities takes a short time (seconds). I know that in general objects come with some overhead -- and that in OOP the overhead associated with keeping track of such objects is part of the cost for the ease using them. But why does loading an pickled object take SO long. What is happening under the hood? What are the costs associated with serializing an object as opposed to merely writing its data to a file? Does pickling restore the object to the same locations in memory or something? (Maybe moving other objects out of the way). If serialization loads slower (at least pickle is) than what is the benefit?

bernie2436
  • 22,841
  • 49
  • 151
  • 244

2 Answers2

4

Assuming that you are using the Python 2.7 standard pickle and json modules...

So you're basically comparing a pure-Python deserializer to an optimized C deserializer. Not a fair comparison, even if the serialization formats were identical.

Dan Lenski
  • 76,929
  • 13
  • 76
  • 124
  • I have the same question as OP, comparing loading a text file line by line (into small objects, one per line) vs loading the pickle of the list of objects. I use cPickle, binary format and highest protocol, and unpickling is 50% slower than reading and rebuilding. – Nikana Reklawyks Jul 31 '16 at 09:21
2

There are speed comparisons out there for the serialization of particular objects, comparing JSON and pickle and cPickle. The speed of each object will be different in each format. JSON is usually comparably faster than pickle, and you often hear not to use pickle because it's insecure. The reason for security concerns, and some of the speed lag, is that pickle doesn't actually serialize very much data -- instead it serializes some data and a bunch of instructions, where the instructions are used to assemble the python objects. If you've ever looked at the dis module, you'll see the type of instructions that pickle uses for each object. cPickle is, like json, not pure-python, and leverages optimized C, so it's often faster.

Pickling should take up less space, in general than storing an object itself -- in general, however, some instruction sets can be quite large. JSON tends to be smaller… and is human readable… however, since json stores everything as human-readable strings… it can't serialize as many different kinds of objects as pickle and cPickle can. So the trade-off is json for "security" (or inflexibility, depending on your perspective) and hunan-readability versus pickle with a broader range of objects it can serialize.

Another good reason for choosing pickle (over json) is that you can easily extend pickle, meaning that you can register a new method to serialize an object that pickle doesn't know how to pickle. Python gives you several ways to do that… __getstate__ and __setstate__ as well as the copy_reg method. Using these methods, you'll find that people have extended pickle to serialize most of python objects, for example dill.

Pickling doesn't restore the objects to the same memory location. However, it does reconstitute the object to the same state (generally) as when it was pickled. If you want to see some reasons why people pickle, take a look here:

Python serialization - Why pickle?

http://nbviewer.ipython.org/gist/minrk/5241793

http://matthewrocklin.com/blog/work/2013/12/05/Parallelism-and-Serialization/

Community
  • 1
  • 1
Mike McKerns
  • 33,715
  • 8
  • 119
  • 139