Serialization of a pandas DataFrame

Question

Is there a fast way to do serialization of a DataFrame?

I have a grid system which can run pandas analysis in parallel. In the end, I want to collect all the results (as a DataFrame) from each grid job and aggregate them into a giant DataFrame.

How can I save data frame in a binary format that can be loaded rapidly?

See related question http://stackoverflow.com/questions/12772498/serialize-pandas-python-dataframe-to-binary-format — Mihai8, Jun 06 '13 at 20:44
Nice [blog post](http://matthewrocklin.com/blog/work/2015/03/16/Fast-Serialization) with timings/discussions of the different I/O options — lunguini, Feb 25 '19 at 20:36

score 29 · Accepted Answer · edited May 14 '22 at 14:27

29

The easiest way is just to use to_pickle (as a pickle), see pickling from the docs api page:

df.to_pickle(file_name)

Another option is to use HDF5 (built on PyTables). It is slightly more work to get started but much richer for querying.

edited May 14 '22 at 14:27

Felix D.

786
1
11
17

answered Jun 06 '13 at 20:46

Andy Hayden

359,921
101
625
535

1

Their docs seem to need some work. The `.save()` method has absolutely no description. – voithos Jun 06 '13 at 20:48
@voithos I realised that as I was looking for a link... :( – Andy Hayden Jun 06 '13 at 20:49
2

[This](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.core.common.save.html) seems to be the best out there... – Andy Hayden Jun 06 '13 at 20:52
7

FWIW save will be changed to [to_pickle](https://github.com/pydata/pandas/issues/3782) in pandas 0.12. – Andy Hayden Jun 06 '13 at 21:08
3

Please note [Python 3 - Can pickle handle byte objects larger than 4GB?](https://stackoverflow.com/q/31468117/562769) – Martin Thoma Nov 14 '18 at 11:25

score 5 · Answer 2 · edited May 14 '22 at 19:27

5

DataFrame.to_msgpack is experimental and not without some issues e.g. with Unicode, but it is much faster than pickling. It serialized a dataframe with 5 million rows that was taking 2-3 Gb of memory in about 2 seconds, and the resulting file was about 750 Mb. Loading is somewhat slower, but still way faster than unpickling.

edited May 14 '22 at 19:27

Felix D.

786
1
11
17

answered Jan 27 '15 at 03:02

Sergey Orshanskiy

6,794
1
46
50

4

to_msgpack is deprecated since 0.25.0. – Kevin Jan 24 '20 at 23:06
Broken link. The project, as noted, must not exist any more (?). – Dan Nissenbaum Jun 08 '21 at 06:31
Relevant issue for the deprecation: https://github.com/pandas-dev/pandas/issues/27084 – Felix D. May 14 '22 at 09:03

score 2 · Answer 3 · answered Feb 01 '23 at 18:50

I prepared this answer for this question, which was erroneously marked as a duplicate of this one.

The best method for speed is not the best method for portability or fidelity. Pickle is fast and faithful, but not portable or archival safe. HFD is portable and archival safe, but is slower and can only store DataFrames with certain formats and structures.

Summary:

For sharing and archiving of simple tables, where some changes in fomat are tolerable: csv, excel, or json, depending on your application.
For perfect save-and-restore, but no portability or archival safety: pickle
For archiving: hdf, but not all tables can be saved portably or losslessly in the format. You may need to restructure things and convert some types.

Details: We'd like a method that pandas already supports with both .to_format method in the DataFrame class and a read_format method in the pandas module. In Pandas 1.5.2 these are csv, excel, feather, gbq, hdf, html, json, orc, parquet, pickle, sql, stata, xml.

The formats excel and csv are highly portable and nice for simple tables. Complicated tables and datastructures won't survive the round trip.
json is also highly portable, but will change the data in the table. NaNs will be converted to None, numpy arrays may convert to nested lists, etc.
I'll skip feather, gbq, orc, parquet, sql, and stata. These are specific formats not wholly compatible with the DataTable format. They are either not very portable, or not very flexible. I'll also skip html, it can't faithfully save and restore all of the details of a DataFrame.
pickle is the easiest to use for a faithful save/restore. However, it is not portable and not archival safe. Expect pickle files to fail to load correctly in future versions.
This leaves hdf. This should be an achival safe and highly portable format. Many scientific applications read or store hdf files. However, python will still need to pickle any dataframe contents that can't be converted to ctypes.

score 1 · Answer 4 · answered Jun 06 '13 at 20:45

1

Have to timed the available io functions? Binary is not automatically faster and HDF5 should be quite fast to my knowledge.

answered Jun 06 '13 at 20:45

Achim

15,415
15
80
144

Serialization of a pandas DataFrame

4 Answers4

Linked

Related