2

There is a lot of documentation on the most efficient way to store pandas dataframes (e.g. How to store a dataframe using Pandas), but most of the resources focus on i/o time efficiency.

I would like to save large pandas dataframes, which typically use several Gb of disk storage in a csv format, to a more lightweight format without losing any information.

The LightGBM Dataset looks promising, but I did not manage to correctly reload my data.

Any suggestions?

Tanguy
  • 3,124
  • 4
  • 21
  • 29
  • I usually use `joblib` which saves in binary. I hear `df.to_feather` is also efficient, but never try. – Quang Hoang May 24 '19 at 12:47
  • I use pandas to_hdf with blosс compression. Look at the comparison here: https://dziganto.github.io/out-of-core%20computation/HDF5-Or-How-I-Learned-To-Love-Data-Compression-And-Partial-Input-Output/ – AT_asks May 24 '19 at 13:02
  • 1
    https://towardsdatascience.com/the-best-format-to-save-pandas-data-414dca023e0d "As our little test shows, it seems that feather format is an ideal candidate to store the data between Jupyter sessions. It shows high I/O speed, doesn’t take too much memory on the disk and doesn’t need any unpacking when loaded back into RAM." – wkzhu May 24 '19 at 20:38

1 Answers1

1

If you are looking for filesize, Apache Parquet may be your best friend. As @wkzhu's article suggests, this achieves the best compression, particularly if you have categorical data.

Timbus Calin
  • 13,809
  • 5
  • 41
  • 59
lozbeardy
  • 11
  • 1