5

I am trying to find the best way to efficiently write large data frames (250MB+) to and from disk using Python/Pandas. I've tried all of the methods in Python for Data Analysis, but the performance has been very disappointing.

This is part of a larger project exploring migrating our current analytic/data management environment from Stata to Python. When I compare the read/write times in my tests to those that I get with Stata, Python and Pandas are typically taking more than 20 times as long.

I strongly suspect that I am the problem, not Python or Pandas.

Any suggestions?

user2928791
  • 71
  • 2
  • 3

1 Answers1

10

Using HDFStore is your best bet (not covered very much in the book, and has changed quite a lot). You will find performance is MUCH better than any other serialization method.

Community
  • 1
  • 1
Jeff
  • 125,376
  • 21
  • 220
  • 187
  • 2
    Indeed, HDF5 proved to work well, particularly if the right set of options was used. Using blosc compression, chunksize=4, and complevel=3 proved the fastest. – user2928791 Oct 28 '13 at 20:04
  • a lot depends on how you are storing (e.g. appending all at once is usually best), do you need to append, and compression. my2c; that chunksize is pretty small, default is 50k rows. – Jeff Oct 28 '13 at 20:19