How to deserialize 1GB of objects into Python faster than cPickle?

Question

We've got a Python-based web server that unpickles a number of large data files on startup using cPickle. The data files (pickled using HIGHEST_PROTOCOL) are around 0.4 GB on disk and load into memory as about 1.2 GB of Python objects -- this takes about 20 seconds. We're using Python 2.6 on 64-bit Windows machines.

The bottleneck is certainly not disk (it takes less than 0.5s to actually read that much data), but memory allocation and object creation (there are millions of objects being created). We want to reduce the 20s to decrease startup time.

Is there any way to deserialize more than 1GB of objects into Python much faster than cPickle (like 5-10x)? Because the execution time is bound by memory allocation and object creation, I presume using another unpickling technique such as JSON wouldn't help here.

I know some interpreted languages have a way to save their entire memory image as a disk file, so they can load it back into memory all in one go, without allocation/creation for each object. Is there a way to do this, or achieve something similar, in Python?

This might be your chance to get a solid state drive. Is this to speed up dev? To allow you to do quick deployments? Is the lag in reading the data or unpickling it? If you start with an empty instance, what is the start up time? — Scott, Nov 16 '10 at 14:59
Note that I mention in my question the bottleneck is not drive/read speed, but unpickling and object creation speed. It's more for quick deployments -- to allow our server to restart quickly. I'm not quite sure what you mean by "empty instance" here. — Ben Hoyt, Nov 16 '10 at 15:08
For a 750MB pickle binary file, wrapping the cPickle load call with gc.disable() / gc.enable() drastically reduced the total time required by around 20x. See [here](http://stackoverflow.com/a/36699998/2385420) — Tejas Shah, Apr 18 '16 at 17:42

Beni Cherniavsky-Paskin · Accepted Answer · 2015-11-10T22:53:28.107

17

Try the marshal module - it's internal (used by the byte-compiler) and intentionally not advertised much, but it is much faster. Note that it doesn't serialize arbitrary instances like pickle, only builtin types (don't remember the exact constraints, see docs). Also note that the format isn't stable.
If you need to initialize multiple processes and can tolerate one process always loaded, there is an elegant solution: load the objects in one process, and then do nothing in it except forking processes on demand. Forking is fast (copy on write) and shares the memory between all processes. [Disclaimers: untested; unlike Ruby, Python ref counting will trigger page copies so this is probably useless if you have huge objects and/or access a small fraction of them.]
If your objects contain lots of raw data like numpy arrays, you can memory-map them for much faster startup. pytables is also good for these scenarios.
If you'll only use a small part of the objects, then an OO database (like Zope's) can probably help you. Though if you need them all in memory, you will just waste lots of overhead for little gain. (never used one, so this might be nonsense).
Maybe other python implementations can do it? Don't know, just a thought...

edited Nov 10 '15 at 22:53

answered Nov 16 '10 at 15:32

Beni Cherniavsky-Paskin

9,483
2
50
58

Thanks, helpful stuff. FYI, in my quick test on a large file with lots of objects, marshal.loads() was about twice as fast as pickle.loads(). – Ben Hoyt Nov 16 '10 at 16:37
Same experience here on a huge dictionary; marshal.load takes only 0.78s, where cPickle.load takes 1.2s. – unhammer Aug 18 '11 at 12:16
Option number two would be problematic, as objects will be copied for each forked sub-process the moment you reference these objects. This is because each object has a ref count which changes every time you access the object. This in turn is like changing the object and thus causes the memory to be copied. Essentially copy on write becomes copy on access when python is concerned... – FableBlaze Jan 04 '13 at 23:23
@anti666: Indeed, though for "wide" objects - huge lists or dicts - the bulk of the memory remains sharable, just the headers are inc/decref'd. If the child is going to access a big portion of the data, page faults will cost much more than is saved. Assuming it only needs small portions of the data _at once_, the best approaches are things like numpy / pytables / OO DBs that support random access but materialize python objects on demand. Heck, even `shelve` might be good! [Disclaimer: I never timed forking like this, just speculating...] – Beni Cherniavsky-Paskin Jan 16 '13 at 01:31

ondra · Answer 2 · 2010-11-16T16:13:49.120

7

Are you load()ing the pickled data directly from the file? What about to try to load the file into the memory and then do the load? I would start with trying the cStringIO(); alternatively you may try to write your own version of StringIO that would use buffer() to slice the memory which would reduce the needed copy() operations (cStringIO still may be faster, but you'll have to try).

There are sometimes huge performance bottlenecks when doing these kinds of operations especially on Windows platform; the Windows system is somehow very unoptimized for doing lots of small reads while UNIXes cope quite well; if load() does lot of small reads or you are calling load() several times to read the data, this would help.

edited Nov 16 '10 at 16:13

answered Nov 16 '10 at 15:30

ondra

9,122
1
25
34

1

To the person who gave me the -1: try loading a file by calling read(1) on Win; then try to do it on Unix. It takes several seconds to read a few megabytes on Windows; it's still instantanous on Unix. If benhoyt is loading a lot of objects by calling several tens of thousands pickle.load() calls from a file, this could be a factor. – ondra Nov 16 '10 at 16:05
Good call. On our data, changing "obj = pickle.load(f)" to "s = f.read(); obj = pickle.loads(s)" gives a speed increase of 30%. Not orders of magnitude, but worth knowing about. (BTW, I accidentally pressed down instead of up; feel free to make a minor edit to your answer so I can upvote it.) – Ben Hoyt Nov 16 '10 at 16:10
I found there to be a pretty substantial improvement to do the same process with the marshal module (which makes sense since it is a Windows problem). – Justin Peel Nov 16 '10 at 18:47

score 4 · Answer 3 · edited Nov 16 '10 at 15:41

I haven't used cPickle (or Python) but in cases like this I think the best strategy is to avoid unnecessary loading of the objects until they are really needed - say load after start up on a different thread, actually its usually better to avoid unnecessary loading/initialization at anytime for obvious reasons. Google 'lazy loading' or 'lazy initialization'. If you really need all the objects to do some task before server start up then maybe you can try to implement a manual custom deserialization method, in other words implement something yourself if you have intimate knowledge of the data you will deal with which can help you 'squeeze' better performance then the general tool for dealing with it.

score 3 · Answer 4 · answered Nov 16 '10 at 15:09

3

Did you try sacrificing efficiency of pickling by not using HIGHEST_PROTOCOL? It isn't clear what performance costs are associated with using this protocol, but it might be worth a try.

answered Nov 16 '10 at 15:09

Satwik

111
3

Good thought. However, we were using the default (lowest) protocol at first, but switching to HIGHEST_PROTOCOL (a binary-based protocol) sped it up by a factor of two. So HIGHEST_PROTOCOL is definitely faster. – Ben Hoyt Nov 16 '10 at 16:00

score 2 · Answer 5 · answered Nov 16 '10 at 15:17

Impossible to answer this without knowing more about what sort of data you are loading and how you are using it.

If it is some sort of business logic, maybe you should try turning it into a pre-compiled module;

If it is structured data, can you delegate it to a database and only pull what is needed?

Does the data have a regular structure? Is there any way to divide it up and decide what is required and only then load it?

score 2 · Answer 6 · answered Nov 16 '10 at 16:29

I'll add another answer that might be helpful - if you can, can you try to define _slots_ on the class that is most commonly created? This may be a little limiting and impossible, however it seems to have cut the time needed for initialization on my test to about a half.

How to deserialize 1GB of objects into Python faster than cPickle?

6 Answers6