UPDATE: nowadays I would choose between Parquet, Feather (Apache Arrow), HDF5 and Pickle.
Pro's and Contra's:
- Parquet
- pros
- one of the fastest and widely supported binary storage formats
- supports very fast compression methods (for example Snappy codec)
- de-facto standard storage format for Data Lakes / BigData
- contras
- the whole dataset must be read into memory. You can't read a smaller subset. One way to overcome this problem is to use partitioning and to read only required partitions.
- no support for indexing. you can't read a specific row or a range of rows - you always have to read the whole Parquet file
- Parquet files are immutable - you can't change them (no way to append, update, delete), one can only either write or overwrite to Parquet file. Well this "limitation" comes from the BigData and would be considered as one of the huge "pros" there.
- HDF5
- pros
- supports data slicing - ability to read a portion of the whole dataset (we can work with datasets that wouldn't fit completely into RAM).
- relatively fast binary storage format
- supports compression (though the compression is slower compared to Snappy codec (Parquet) )
- supports appending rows (mutable)
- contras
- Pickle
- pros
- contras
- requires much space on disk
- for a long term storage one might experience compatibility problems. You might need to specify the Pickle version for reading old Pickle files.
OLD Answer:
I would consider only two storage formats: HDF5 (PyTables) and Feather
Here are results of my read and write comparison for the DF (shape: 4000000 x 6, size in memory 183.1 MB, size of uncompressed CSV - 492 MB).
Comparison for the following storage formats: (CSV
, CSV.gzip
, Pickle
, HDF5
[various compression]):
read_s write_s size_ratio_to_CSV
storage
CSV 17.900 69.00 1.000
CSV.gzip 18.900 186.00 0.047
Pickle 0.173 1.77 0.374
HDF_fixed 0.196 2.03 0.435
HDF_tab 0.230 2.60 0.437
HDF_tab_zlib_c5 0.845 5.44 0.035
HDF_tab_zlib_c9 0.860 5.95 0.035
HDF_tab_bzip2_c5 2.500 36.50 0.011
HDF_tab_bzip2_c9 2.500 36.50 0.011
But it might be different for you, because all my data was of the datetime
dtype, so it's always better to make such a comparison with your real data or at least with the similar data...