8

I am calculating some very large numbers using Python, and I'd like to store previously calculated results in Berkeley DB.

The problem is that Berkeley DB has to use strings, and I have to store an integer tuple for the calculation results.

For example, I get (m, n) as my result, one way is to store this as "%d,%d" % (m, n) and read it out using re. I can also store the tuple using pickle or marshal.

Which has the better performance?

Tianyang Li
  • 1,755
  • 5
  • 26
  • 42
  • Why would you use `re` to parse that? Why are you concerned about performance? If you're concerned about performance, why are you expecting interpreting the saved data to be the bottleneck? What is the nature of your "previously calculated results"? Why wouldn't you store a tuple with, you know, multiple columns? Since when do databases limit you to strings only? None of this is making any sense. – Karl Knechtel Mar 12 '12 at 06:47
  • @KarlKnechtel: Berkeley DB does not have columns. It is a key-value database, one of many: Tokyo / Kyoto Cabinet, Memcached, Cassandra, Dynamo, Voldemort are other examples. – Dietrich Epp Mar 12 '12 at 06:52
  • @KarlKnechtel I'm using Berkeley DB so I don't have multiple columns, if I were using other database then I wouldn't worry about it. See http://stackoverflow.com/questions/2399643/expressing-multiple-columns-in-berkeley-db-in-python – Tianyang Li Mar 12 '12 at 06:54

5 Answers5

16

For pure speed, marshal will get you the fastest results.

Timings:

>>> timeit.timeit("pickle.dumps([1,2,3])","import pickle",number=10000)
0.2939901351928711
>>> timeit.timeit("json.dumps([1,2,3])","import json",number=10000)
0.09756112098693848
>>> timeit.timeit("pickle.dumps([1,2,3])","import cPickle as pickle",number=10000)
0.031056880950927734
>>> timeit.timeit("marshal.dumps([1,2,3])","import marshal", number=10000)
0.00703883171081543
Amber
  • 507,862
  • 82
  • 626
  • 550
  • It also turns out that if I don't want it do be human-readble, marshal's faster. – Tianyang Li Mar 12 '12 at 07:04
  • I tested marshal against msgpack but marshal won in terms of speed. marshal avg time for 15000 operations on a small list = 0.0003171195348103841, time for msgpack for same test = 0.0008052133083343506. I did not check space usage though... – Urjit Mar 15 '12 at 05:07
  • 1
    Keep in mind this warning from marshal docs: http://docs.python.org/library/marshal.html Warning The marshal module is not intended to be secure against erroneous or maliciously constructed data. Never unmarshal data received from an untrusted or unauthenticated source. – Urjit Mar 15 '12 at 06:03
  • @Urjit Same is said for pickle. That wouldn't the reason to pick one over the other. – Hielke Walinga Mar 26 '21 at 14:48
9

When somebody are thinking about performance he should to remember 3 things:

  • Don't trust anybody - any benchmark can lie (by a different reasons: unprofessional, marketing, etc.)
  • Always measure your case - for example, cache system and statistics have totally different requirements. In one case you need to read as fast as possible, in other case - write
  • Repeat tests - new version of any software could be faster/slower, so any update could introduce benefits/penalties

For example, here is results of my benchmark:

jimilian$ python3.5 serializators.py
iterations= 100000
data= 'avzvasdklfjhaskldjfhkweljrqlkjb*@&$Y)(!#&$G@#lkjabfsdflb(*!G@#$(GKLJBmnz,bv(PGDFLKJ'
==== DUMP ====
Pickle:
>> 0.09806302400829736
Json: 2.0.9
>> 0.12253901800431777
Marshal: 4
>> 0.09477431800041813
Msgpack: (0, 4, 7)
>> 0.16701826300413813

==== LOAD ====
Pickle:
>> 0.10376790800364688
Json: 2.0.9
>> 0.30041573599737603
Marshal: 4
>> 0.034003349996055476
Msgpack: (0, 4, 7)
>> 0.061493027009419166

jimilian$ python3.5 serializators.py
iterations= 100000
data= [1,2,3]*100
==== DUMP ====
Pickle:
>> 0.9678693519963417
Json: 2.0.9
>> 4.494351467001252
Marshal: 4
>> 0.8597690019960282
Msgpack: (0, 4, 7)
>> 1.2778299400088144

==== LOAD ====
Pickle:
>> 1.0350999219954247
Json: 2.0.9
>> 3.349724347004667
Marshal: 4
>> 0.468191737003508
Msgpack: (0, 4, 7)
>> 0.3629750510008307

jimilian$ python2.7 serializators.py
iterations= 100000
data= [1,2,3]*100
==== DUMP ====
Pickle:
>> 50.5894570351
Json: 2.0.9
>> 2.69190311432
cPickle: 1.71
>> 5.14689707756
Marshal: 2
>> 0.539206981659
Msgpack: (0, 4, 7)
>> 0.752672195435

==== LOAD ====
Pickle:
>> 58.8052768707
Json: 2.0.9
>> 3.50090789795
cPickle: 1.71
>> 8.46298909187
Marshal: 2
>> 0.469168901443
Msgpack: (0, 4, 7)
>> 0.315001010895

So, as you can see sometimes it's better to use Pickle (python3, long string, dump), sometimes - msgpack (python3, long array, load), in python2 - things works completely different. That's why nobody can give certain answer that will be valid for everybody.

TAbdiukov
  • 1,185
  • 3
  • 12
  • 25
Jimilian
  • 3,859
  • 30
  • 33
3

Time them and find out!

I'd expect cPickle to be the fastest but that's no guarantee.

torek
  • 448,244
  • 59
  • 642
  • 775
  • 1
    Note that the OP doesn't mention a Python version, and `cPickle` doesn't exist separately from `pickle` in Py3 - `pickle` will provide the optimised version of it exists, and fall back to the pure-python version otherwise. – lvc Mar 12 '12 at 06:58
1

Check out shelve, a simple persistent key-value store with a dictionary-like API that uses pickle to serialize objects.

kindall
  • 178,883
  • 35
  • 278
  • 309
0

In python3.8 speed comparison's result may be different that what was shown in this answer.

Python 3.8.10 (default, May  4 2021, 00:00:00) 
[GCC 10.2.1 20201125 (Red Hat 10.2.1-9)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> 
>>> import timeit
>>> 
>>> timeit.timeit("pickle.dumps([1,2,3])","import pickle",number=10000)
0.005186535003304016
>>> timeit.timeit("json.dumps([1,2,3])","import json",number=10000)
0.03863359600654803
>>> timeit.timeit("marshal.dumps([1,2,3])","import marshal", number=10000)
0.00884882499667583
>>> 

It seems that pickle is now a little bit faster than marshal.

Amir reza Riahi
  • 1,540
  • 2
  • 8
  • 34