Pickle alternatives

Question

I am trying to serialize a large (~10**6 rows, each with ~20 values) list, to be used later by myself (so pickle's lack of safety isn't a concern).

Each row of the list is a tuple of values, derived from some SQL database. So far, I have seen datetime.datetime, strings, integers, and NoneType, but I might eventually have to support additional data types.

For serialization, I've considered pickle (cPickle), json, and plain text - but only pickle saves the type information: json can't serialize datetime.datetime, and plain text has its obvious disadvantages.

However, cPickle is pretty slow for data this large, and I'm looking for a faster alternative.

Have you considered dumping it into an SQLite database? – rmmh Mar 27 '12 at 20:52 — rmmh, Mar 27 '12 at 20:52
Actually - I haven't. Might be the simplest... – Guy Adini Mar 28 '12 at 07:13 — Guy Adini, Mar 28 '12 at 07:13

Jake Biesinger · Answer 1 · 2015-04-18T19:09:17.827

15

Pickle is actually quite fast so long as you aren't using the (default) ASCII protocol. Just make sure to dump using protocol=pickle.HIGHEST_PROTOCOL.

edited Apr 18 '15 at 19:09

answered Aug 23 '12 at 15:23

Jake Biesinger

5,538
2
23
25

4

It should be noted that for `python3` the default format is actually binary, according to the docs. http://docs.python.org/3.4/library/pickle.html?highlight=pickle#pickle – Seanny123 Dec 02 '13 at 12:00
3

A semantically better alternative is `protocol=pickle.HIGHEST_PROTOCOL` – Martin Thoma Dec 17 '14 at 10:59
1

Thanks, @moose! Updated from `protocol=-1`. – Jake Biesinger Apr 18 '15 at 19:09
as in `pickle.dump(data, file, protocol=pickle.HIGHEST_PROTOCOL)` – MrMartin Aug 22 '18 at 13:29
2

The highest protocol keeps on changing, and any negative number uses highest protocol. If you persist something in highest protocol in python 3.8 which is 4 (it was 3 in 3.7), upgrade to some later python which uses protocol 5 as highest protocol, you will have problems deserializing it – nurettin Feb 26 '20 at 10:58

score 7 · Answer 2 · answered Jul 26 '13 at 14:49

Protocol buffers are a flexible, efficient, automated mechanism for serializing structured data – think XML, but smaller, faster, and simpler.

advantages over XML:

are simpler

are 3 to 10 times smaller

are 20 to 100 times faster

are less ambiguous

generate data access classes that are easier to use programmatically

https://developers.google.com/protocol-buffers/docs/pythontutorial

Martin Thoma · Answer 3 · 2022-02-20T03:23:43.883

Protocol Buffer - e.g. used in Caffe; maintains type information, but you have to put quite much effort in it compared to pickle
MessagePack: See python package - supports streaming (source)
BSON: see python package docs

Depending on what exactly you want to store, there are other alternatives:

The way to compare those is:

Ease of use / Programming language support / Tooling support
Being readable by a human
Storage size
Read-time
Write-time
Features: (1) Append data (2) Read single row (3) having a schema

score 5 · Accepted Answer · answered Mar 27 '12 at 20:49

I think you should give PyTables a look. It should be ridiculously fast, at least faster than using an RDBMS, since it's very lax and doesn't impose any read/write restrictions, plus you get a better interface for managing your data, at least compared to pickling it.

score 3 · Answer 5 · answered Jan 07 '18 at 17:50

3

For hundreds of thousands of simple (up to JSON-compatible) complexity Python objects, I've found the best combination of simplicity, speed, and size by combining:

It beats pickle and cPickle options by orders of magnitude.

with gzip.open(filename, 'wb') as f:
    ubjson.dump(items, f)


with gzip.open(filename, 'rb') as f:
    return ubjson.load(f)

answered Jan 07 '18 at 17:50

Apalala

9,017
3
30
48

items not defined – Cybernetic Nov 12 '20 at 03:05
I suppose item=pandas dataframe – user88484 Feb 07 '23 at 12:32
`items` is any JSON compatible object in Python – Apalala Feb 08 '23 at 13:23

score 3 · Answer 6 · answered Jul 16 '19 at 13:23

3

Just for the sake of completeness - there is also dill library that extends pickle.

How to dill (pickle) to file?

answered Jul 16 '19 at 13:23

sophros

14,672
11
46
75

score 2 · Answer 7 · answered Mar 27 '12 at 20:51

2

I usually serialize to plain text (*.csv) because I found it to be fastest. The csv module works quite well. See http://docs.python.org/library/csv.html

If you have to deal with unicode for your strings, check out the UnicodeReader and UnicodeWriter examples at the end.

If you serialize for your own future use, I guess it would suffice to know that you have the same data type per csv column (e.g., string are always on column 2).

answered Mar 27 '12 at 20:51

Bogdan Vasilescu

407
4
22

That's not so good for me - since it doesn't maintain type information, I have to loop over the data and convert it, which is very slow (at least in my implementation, using a list comprehension of list comprehensions). – Guy Adini Mar 28 '12 at 07:19

score 1 · Answer 8 · answered Jan 07 '19 at 12:50

1

Avro seems to be promising and properly designed but yet non popular solution.

answered Jan 07 '19 at 12:50

SergeyR

468
5
10

Pickle alternatives

8 Answers8

Linked