How to store a pandas dataframe in the smallest format possible?

Question

There is a lot of documentation on the most efficient way to store pandas dataframes (e.g. How to store a dataframe using Pandas), but most of the resources focus on i/o time efficiency.

I would like to save large pandas dataframes, which typically use several Gb of disk storage in a csv format, to a more lightweight format without losing any information.

The LightGBM Dataset looks promising, but I did not manage to correctly reload my data.

Any suggestions?

I usually use `joblib` which saves in binary. I hear `df.to_feather` is also efficient, but never try. — Quang Hoang, May 24 '19 at 12:47
I use pandas to_hdf with blosс compression. Look at the comparison here: https://dziganto.github.io/out-of-core%20computation/HDF5-Or-How-I-Learned-To-Love-Data-Compression-And-Partial-Input-Output/ — AT_asks, May 24 '19 at 13:02
https://towardsdatascience.com/the-best-format-to-save-pandas-data-414dca023e0d "As our little test shows, it seems that feather format is an ideal candidate to store the data between Jupyter sessions. It shows high I/O speed, doesn’t take too much memory on the disk and doesn’t need any unpacking when loaded back into RAM." — wkzhu, May 24 '19 at 20:38

score 1 · Answer 1 · edited Jul 19 '22 at 14:01

1

If you are looking for filesize, Apache Parquet may be your best friend. As @wkzhu's article suggests, this achieves the best compression, particularly if you have categorical data.

edited Jul 19 '22 at 14:01

Timbus Calin

13,809
5
41
59

answered Sep 24 '21 at 08:39

lozbeardy

11
1

How to store a pandas dataframe in the smallest format possible?

1 Answers1