30

Is there a fast way to do serialization of a DataFrame?

I have a grid system which can run pandas analysis in parallel. In the end, I want to collect all the results (as a DataFrame) from each grid job and aggregate them into a giant DataFrame.

How can I save data frame in a binary format that can be loaded rapidly?

Andy Hayden
  • 359,921
  • 101
  • 625
  • 535
James Bond
  • 7,533
  • 19
  • 50
  • 64
  • See related question http://stackoverflow.com/questions/12772498/serialize-pandas-python-dataframe-to-binary-format – Mihai8 Jun 06 '13 at 20:44
  • Nice [blog post](http://matthewrocklin.com/blog/work/2015/03/16/Fast-Serialization) with timings/discussions of the different I/O options – lunguini Feb 25 '19 at 20:36

4 Answers4

29

The easiest way is just to use to_pickle (as a pickle), see pickling from the docs api page:

df.to_pickle(file_name)

Another option is to use HDF5 (built on PyTables). It is slightly more work to get started but much richer for querying.

Felix D.
  • 786
  • 1
  • 11
  • 17
Andy Hayden
  • 359,921
  • 101
  • 625
  • 535
5

DataFrame.to_msgpack is experimental and not without some issues e.g. with Unicode, but it is much faster than pickling. It serialized a dataframe with 5 million rows that was taking 2-3 Gb of memory in about 2 seconds, and the resulting file was about 750 Mb. Loading is somewhat slower, but still way faster than unpickling.

Felix D.
  • 786
  • 1
  • 11
  • 17
Sergey Orshanskiy
  • 6,794
  • 1
  • 46
  • 50
2

I prepared this answer for this question, which was erroneously marked as a duplicate of this one.

The best method for speed is not the best method for portability or fidelity. Pickle is fast and faithful, but not portable or archival safe. HFD is portable and archival safe, but is slower and can only store DataFrames with certain formats and structures.

Summary:

  • For sharing and archiving of simple tables, where some changes in fomat are tolerable: csv, excel, or json, depending on your application.
  • For perfect save-and-restore, but no portability or archival safety: pickle
  • For archiving: hdf, but not all tables can be saved portably or losslessly in the format. You may need to restructure things and convert some types.

Details: We'd like a method that pandas already supports with both .to_format method in the DataFrame class and a read_format method in the pandas module. In Pandas 1.5.2 these are csv, excel, feather, gbq, hdf, html, json, orc, parquet, pickle, sql, stata, xml.

  • The formats excel and csv are highly portable and nice for simple tables. Complicated tables and datastructures won't survive the round trip.
  • json is also highly portable, but will change the data in the table. NaNs will be converted to None, numpy arrays may convert to nested lists, etc.
  • I'll skip feather, gbq, orc, parquet, sql, and stata. These are specific formats not wholly compatible with the DataTable format. They are either not very portable, or not very flexible. I'll also skip html, it can't faithfully save and restore all of the details of a DataFrame.
  • pickle is the easiest to use for a faithful save/restore. However, it is not portable and not archival safe. Expect pickle files to fail to load correctly in future versions.
  • This leaves hdf. This should be an achival safe and highly portable format. Many scientific applications read or store hdf files. However, python will still need to pickle any dataframe contents that can't be converted to ctypes.
MRule
  • 529
  • 1
  • 6
  • 18
1

Have to timed the available io functions? Binary is not automatically faster and HDF5 should be quite fast to my knowledge.

Achim
  • 15,415
  • 15
  • 80
  • 144