Efficiently writing large Pandas data frames to disk

Question

I am trying to find the best way to efficiently write large data frames (250MB+) to and from disk using Python/Pandas. I've tried all of the methods in Python for Data Analysis, but the performance has been very disappointing.

This is part of a larger project exploring migrating our current analytic/data management environment from Stata to Python. When I compare the read/write times in my tests to those that I get with Stata, Python and Pandas are typically taking more than 20 times as long.

I strongly suspect that I am the problem, not Python or Pandas.

Any suggestions?

You may re-read the HDF5 paragraph in the book, it is very efficient to get persistent storage with this store. It could help if you explain and show the code you tested for that method, you may have a inefficiency in it. — Zeugma, Oct 28 '13 at 16:08

score 10 · Answer 1 · edited May 23 '17 at 10:27

10

Using HDFStore is your best bet (not covered very much in the book, and has changed quite a lot). You will find performance is MUCH better than any other serialization method.

edited May 23 '17 at 10:27

Community

1
1

answered Oct 28 '13 at 16:08

Jeff

125,376
21
220
187

2

Indeed, HDF5 proved to work well, particularly if the right set of options was used. Using blosc compression, chunksize=4, and complevel=3 proved the fastest. – user2928791 Oct 28 '13 at 20:04
a lot depends on how you are storing (e.g. appending all at once is usually best), do you need to append, and compression. my2c; that chunksize is pretty small, default is 50k rows. – Jeff Oct 28 '13 at 20:19

Efficiently writing large Pandas data frames to disk

1 Answers1

Linked