I prepared this answer for this question, which was erroneously marked as a duplicate of this one.
The best method for speed is not the best method for portability or fidelity. Pickle is fast and faithful, but not portable or archival safe. HFD is portable and archival safe, but is slower and can only store DataFrames with certain formats and structures.
Summary:
- For sharing and archiving of simple tables, where some changes in fomat are tolerable:
csv
, excel
, or json
, depending on your application.
- For perfect save-and-restore, but no portability or archival safety:
pickle
- For archiving:
hdf
, but not all tables can be saved portably or losslessly in the format. You may need to restructure things and convert some types.
Details: We'd like a method that pandas
already supports with both .to_format
method in the DataFrame
class and a read_format
method in the pandas
module. In Pandas 1.5.2 these are csv
, excel
, feather
, gbq
, hdf
, html
, json
, orc
, parquet
, pickle
, sql
, stata
, xml
.
- The formats
excel
and csv
are highly portable and nice for simple tables. Complicated tables and datastructures won't survive the round trip.
json
is also highly portable, but will change the data in the table. NaN
s will be converted to None
, numpy arrays may convert to nested lists, etc.
- I'll skip
feather
, gbq
, orc
, parquet
, sql
, and stata
. These are specific formats not wholly compatible with the DataTable format. They are either not very portable, or not very flexible. I'll also skip html
, it can't faithfully save and restore all of the details of a DataFrame.
pickle
is the easiest to use for a faithful save/restore. However, it is not portable and not archival safe. Expect pickle files to fail to load correctly in future versions.
- This leaves
hdf
. This should be an achival safe and highly portable format. Many scientific applications read or store hdf
files. However, python will still need to pickle any dataframe contents that can't be converted to ctypes.