46

I have created a dictionary in python and dumped into pickle. Its size went to 300MB. Now, I want to load the same pickle.

output = open('myfile.pkl', 'rb')
mydict = pickle.load(output)

Loading this pickle takes around 15 seconds. How can I reduce this time?

Hardware Specification: Ubuntu 14.04, 4GB RAM

The code bellow shows how much time takes to dump or load a file using json, pickle, cPickle.

After dumping, file size would be around 300MB.

import json, pickle, cPickle
import os, timeit
import json

mydict= {all values to be added}

def dump_json():    
    output = open('myfile1.json', 'wb')
    json.dump(mydict, output)
    output.close()    

def dump_pickle():    
    output = open('myfile2.pkl', 'wb')
    pickle.dump(mydict, output,protocol=cPickle.HIGHEST_PROTOCOL)
    output.close()

def dump_cpickle():    
    output = open('myfile3.pkl', 'wb')
    cPickle.dump(mydict, output,protocol=cPickle.HIGHEST_PROTOCOL)
    output.close()

def load_json():
    output = open('myfile1.json', 'rb')
    mydict = json.load(output)
    output.close()

def load_pickle():
    output = open('myfile2.pkl', 'rb')
    mydict = pickle.load(output)
    output.close()

def load_cpickle():
    output = open('myfile3.pkl', 'rb')
    mydict = pickle.load(output)
    output.close()


if __name__ == '__main__':
    print "Json dump: "
    t = timeit.Timer(stmt="pickle_wr.dump_json()", setup="import pickle_wr")  
    print t.timeit(1),'\n'

    print "Pickle dump: "
    t = timeit.Timer(stmt="pickle_wr.dump_pickle()", setup="import pickle_wr")  
    print t.timeit(1),'\n'

    print "cPickle dump: "
    t = timeit.Timer(stmt="pickle_wr.dump_cpickle()", setup="import pickle_wr")  
    print t.timeit(1),'\n'

    print "Json load: "
    t = timeit.Timer(stmt="pickle_wr.load_json()", setup="import pickle_wr")  
    print t.timeit(1),'\n'

    print "pickle load: "
    t = timeit.Timer(stmt="pickle_wr.load_pickle()", setup="import pickle_wr")  
    print t.timeit(1),'\n'

    print "cPickle load: "
    t = timeit.Timer(stmt="pickle_wr.load_cpickle()", setup="import pickle_wr")  
    print t.timeit(1),'\n'

Output :

Json dump: 
42.5809804916 

Pickle dump: 
52.87407804489 

cPickle dump: 
1.1903790187836 

Json load: 
12.240660209656 

pickle load: 
24.48748306274 

cPickle load: 
24.4888298893

I have seen that cPickle takes less time to dump and load but loading a file still takes a long time.

Mel
  • 5,837
  • 10
  • 37
  • 42
iNikkz
  • 3,729
  • 5
  • 29
  • 59
  • 1
    Are you using [`cPickle`](https://docs.python.org/2/library/pickle.html#module-cPickle)? If not, please try it. You can just use it as a drop-in replacement. – Carsten Nov 11 '14 at 07:57
  • @Carsten : thanks. I heard about cPickle which is faster than Pickle but it doesn't reduce that much of time, i needed. – iNikkz Nov 11 '14 at 08:10
  • Can you add the code to create the dictionary as well. – Phani Nov 11 '14 at 09:17
  • 1
    When you dump the dictionary, try adding an optional protocol argument of [`pickle.HIGHEST_PROTOCOL`](https://docs.python.org/2/library/pickle.html#usage) (or `-1`). This will use a more compact binary-mode data format than the default ASCII-based one. – martineau Nov 11 '14 at 09:25
  • I have edited my question and appended the code. Please check. Also used pickle.HIGHEST_PROTOCOL. – iNikkz Nov 11 '14 at 10:39
  • 2
    @iNikkz by the way if the answer is helpful, please accept the answer by clicking on the green checkbox. – twasbrillig Nov 29 '14 at 10:45
  • @iNikkz another workaround is to wrap the pickle load calls as mentioned [here](http://stackoverflow.com/a/9270029/2385420). It drastically improves the loading performance. – Tejas Shah Jan 19 '17 at 04:37
  • @iNikkz i'll add the snippet and performance benchmark in the answer below. Have a look – Tejas Shah Jan 19 '17 at 04:39
  • dont you have to open the filehandles to be used by json in non-binary mode as json(dump) is textbased? – katsumi Aug 14 '18 at 15:08
  • You are limited to your I/O speed. Please benchmark without writing to disk if you want to benchmark serialization/deserialization tools. It adds an offset to all calculation... and so yes, a file can be "long to load" but not due to pickle/json/etc. issues ! – ZettaCircl May 17 '19 at 08:57
  • 2
    In load_cpickle() you should call cPickle.load(output) instead of pickle.load(output)! – Andreas Abel Apr 23 '20 at 20:36

3 Answers3

33

Try using the json library instead of pickle. This should be an option in your case because you're dealing with a dictionary which is a relatively simple object.

According to this website,

JSON is 25 times faster in reading (loads) and 15 times faster in writing (dumps).

Also see this question: What is faster - Loading a pickled dictionary object or Loading a JSON file - to a dictionary?

Upgrading Python or using the marshal module with a fixed Python version also helps boost speed (code adapted from here):

try: import cPickle
except: import pickle as cPickle
import pickle
import json, marshal, random
from time import time
from hashlib import md5

test_runs = 1000

if __name__ == "__main__":
    payload = {
        "float": [(random.randrange(0, 99) + random.random()) for i in range(1000)],
        "int": [random.randrange(0, 9999) for i in range(1000)],
        "str": [md5(str(random.random()).encode('utf8')).hexdigest() for i in range(1000)]
    }
    modules = [json, pickle, cPickle, marshal]

    for payload_type in payload:
        data = payload[payload_type]
        for module in modules:
            start = time()
            if module.__name__ in ['pickle', 'cPickle']:
                for i in range(test_runs): serialized = module.dumps(data, protocol=-1)
            else:
                for i in range(test_runs): serialized = module.dumps(data)
            w = time() - start
            start = time()
            for i in range(test_runs):
                unserialized = module.loads(serialized)
            r = time() - start
            print("%s %s W %.3f R %.3f" % (module.__name__, payload_type, w, r))

Results:

C:\Python27\python.exe -u "serialization_benchmark.py"
json int W 0.125 R 0.156
pickle int W 2.808 R 1.139
cPickle int W 0.047 R 0.046
marshal int W 0.016 R 0.031
json float W 1.981 R 0.624
pickle float W 2.607 R 1.092
cPickle float W 0.063 R 0.062
marshal float W 0.047 R 0.031
json str W 0.172 R 0.437
pickle str W 5.149 R 2.309
cPickle str W 0.281 R 0.156
marshal str W 0.109 R 0.047

C:\pypy-1.6\pypy-c -u "serialization_benchmark.py"
json int W 0.515 R 0.452
pickle int W 0.546 R 0.219
cPickle int W 0.577 R 0.171
marshal int W 0.032 R 0.031
json float W 2.390 R 1.341
pickle float W 0.656 R 0.436
cPickle float W 0.593 R 0.406
marshal float W 0.327 R 0.203
json str W 1.141 R 1.186
pickle str W 0.702 R 0.546
cPickle str W 0.828 R 0.562
marshal str W 0.265 R 0.078

c:\Python34\python -u "serialization_benchmark.py"
json int W 0.203 R 0.140
pickle int W 0.047 R 0.062
pickle int W 0.031 R 0.062
marshal int W 0.031 R 0.047
json float W 1.935 R 0.749
pickle float W 0.047 R 0.062
pickle float W 0.047 R 0.062
marshal float W 0.047 R 0.047
json str W 0.281 R 0.187
pickle str W 0.125 R 0.140
pickle str W 0.125 R 0.140
marshal str W 0.094 R 0.078

Python 3.4 uses pickle protocol 3 as default, which gave no difference compared to protocol 4. Python 2 has protocol 2 as highest pickle protocol (selected if negative value is provided to dump), which is twice as slow as protocol 3.

Community
  • 1
  • 1
twasbrillig
  • 17,084
  • 9
  • 43
  • 67
  • @twasbrllig : Pretty cool. Json is faster than pickle and cpickle but to load a json file, still a time taking process. Please may you check my updated question and suggests some incredible ideas. – iNikkz Nov 11 '14 at 10:42
  • @Nikkz Using a newer Python and/or a linked third-party module might be even faster than `marshal`. – Cees Timmerman Nov 11 '14 at 16:08
  • `@twasbrillig:` Great thanks. You tried with **int,float,str** and you define a range between **0 to 1000** which is too **small**. if the value of `'n'` will go to **10000000** then what ? **time** will be increased that don't want. – iNikkz Nov 12 '14 at 06:17
  • 3
    @twasbrillig I don't wish to downgrade, but when running `pip` from the `Scripts` dir, i run into http://stackoverflow.com/questions/2817869/error-unable-to-find-vcvarsall-bat @Nikkz The relative time should be the same. For 10 million 30-byte plaintext strings, [use compression](http://stackoverflow.com/a/18475192/819417) to offload the processing burden from the slow storage device to the fast CPU. – Cees Timmerman Nov 12 '14 at 10:29
  • 1
    @CeesTimmerman I tried installing `ujson` with `pip` and got that error too. But there are Windows binaries here http://www.lfd.uci.edu/~gohlke/pythonlibs/#ujson and I installed the 64-bit versions for Python 2.7 and 3.4 and both worked for me! – twasbrillig Nov 12 '14 at 10:47
  • 1
    @twasbrillig Thanks. In 32-bit Python 3.4 on my 64-bit machine, `marshal` is 2 to 3 times faster than `ujson`, and produces up to 50% smaller output. – Cees Timmerman Nov 12 '14 at 11:14
  • Cool, sounds like we have a winner! – twasbrillig Nov 12 '14 at 11:17
  • 1
    I tested `zlib` and `bz2` compression [here](https://gist.github.com/CTimmerman/1f328f02ac2740f4c90d). `zlib` default level 6 is roughly twice as small but 5 times as slow to load, though i only used RAM. – Cees Timmerman Nov 12 '14 at 14:02
  • 2
    JSON will not work if you have any values of bytes in your dictionary, so this post makes a huge assumption. Not everything is json serializable! – Tommy Feb 01 '18 at 19:30
  • @CeesTimmerman This answer looks like supporting using `json` at the beginning but supporting using `pickle` from the python3 statistics. It leads confusion. I checked the edit log and guess you might want to keep the original answer for python 2.7 . It's 2022 now and `pickle` wins for the experiment. I think the answer should be updated. – Rick Jul 25 '22 at 16:06
17

I've had nice results in reading huge files (e.g: ~750 MB igraph object - a binary pickle file) using cPickle itself. This was achieved by simply wrapping up the pickle load call as mentioned here

Example snippet in your case would be something like:

import timeit
import cPickle as pickle
import gc


def load_cpickle_gc():
    output = open('myfile3.pkl', 'rb')

    # disable garbage collector
    gc.disable()

    mydict = pickle.load(output)

    # enable garbage collector again
    gc.enable()
    output.close()


if __name__ == '__main__':
    print "cPickle load (with gc workaround): "
    t = timeit.Timer(stmt="pickle_wr.load_cpickle_gc()", setup="import pickle_wr")
    print t.timeit(1),'\n'

Surely, there might be more apt ways to get the task done, however, this workaround does reduce the time required drastically. (For me, it reduced from 843.04s to 41.28s, around 20x)

Community
  • 1
  • 1
Tejas Shah
  • 638
  • 6
  • 17
  • If i try this, i get the error : TypeError: expected str, bytes or os.PathLike object, not _io.BufferedReader. The pickle was written with "wb" mode – Varlor Mar 02 '18 at 12:17
  • Can you provide the snippet / try following for pickling the obj?: `with open(filename, 'wb') as output: pickle.dump(obj, output, pickle.HIGHEST_PROTOCOL)` – Tejas Shah Mar 03 '18 at 16:45
  • cPickle is sooooo much faster – jsj May 22 '18 at 10:32
  • Thanks very much, this is a very convenient way to speed up my script :) – TabeaKischka Jan 18 '19 at 13:10
  • 2
    How/why does disabling GC help, if at all it helps? – Gokul NC Jul 28 '21 at 06:59
  • 2
    For people who are still reading this answer, it seems that cPickle has been integrated to pickle in Python 3. https://stackoverflow.com/questions/37132899/installing-cpickle-with-python-3-5#comment105077961_37138791 – Carl H Feb 15 '23 at 06:34
7

If you are trying to store the dictionary to a single file, it's the load time for the large file that is slowing you down. One of the easiest things you can do is to write the dictionary to a directory on disk, with each dictionary entry being an individual file. Then you can have the files pickled and unpickled in multiple threads (or using multiprocessing). For a very large dictionary, this should be much faster than reading to and from a single file, regardless of the serializer you choose. There are some packages like klepto and joblib that already do much (if not all of the above) for you. I'd check those packages out. (Note: I am the klepto author. See https://github.com/uqfoundation/klepto).

Mike McKerns
  • 33,715
  • 8
  • 119
  • 139
  • Very intriguing answer! I'm in the same vote I have various serialized (100 to 300MB) pickle files that I would like to create/load into a single dictionary but it takes to much time to individually load would rather cache. Could you possible provide or link a very basic example using kelpto / joblib to achieve this? – nimig18 Jun 30 '22 at 21:00
  • look at `klepto.archives.dir_archive` or `klepto.archives.hdfdir_archive`. Essentially, both have a dictionary interface that's been extended a bit. – Mike McKerns Jul 01 '22 at 00:46
  • There's some basic functionality demonstrated in this test of `dir_archive`: https://github.com/uqfoundation/klepto/blob/master/tests/test_readwrite.py – Mike McKerns Jul 01 '22 at 00:52