I've been working for a while with very large DataFrames and I've been using the csv format to store input data and results. I've noticed that a lot of time goes into reading and writing these files which, for example, dramatically slows down batch processing of data. I was wondering if the file format itself is of relevance. Is there a preferred file format for faster reading/writing Pandas DataFrames and/or Numpy arrays?
5 Answers
Use HDF5. Beats writing flat files hands down. And you can query. Docs are here
Here's a perf comparison vs SQL. Updated to show SQL/HDF_fixed/HDF_table/CSV write and read perfs.
Docs now include a performance section:
See here
-
2I don't know if the performance section has been edited since 2014, but the results there now show that feather and pickle are faster than HDF. But HDF has wider adoption I think? – James Paul Mason Dec 06 '17 at 18:56
-
2The link is moved to [https://pandas.pydata.org/pandas-docs/stable/io.html#performance-considerations](https://pandas.pydata.org/pandas-docs/stable/io.html#performance-considerations) – krismath Mar 07 '18 at 10:37
Recently pandas added support for the parquet format using as backend the library pyarrow
(written by Wes Mckinney himself, with his usual obsession for performance).
You only need to install the pyarrow
library and use the methods read_parquet
and to_parquet
. Parquet is much faster to read and write for bigger datasets (above a few hundred megabytes or more) and it also keep track of dtype metadata, so you won't loose data type information when writing and reading from disk. It can actually store more efficiently some datatypes that HDF5 are not very performant with (like strings and timestamps: HDF5 doesn't have a native data type for those, so it uses pickle to serialize them, which makes slow for big datasets).
Parquet is also a columnar format, which makes it very easy to do two things:
Fastly filter out columns that you're not interested in. With CSV you have to actually read the whole file and only after that you can throw away columns you don't want. With parquet you can actualy read only the columns you're interested.
Make queries filtering out rows and reading only what you care.
Another interesting recent development is the Feather file format, which is also developed by Wes Mckinney. It's essentially just an uncompressed arrow
format written directly to disk, so it is potentially faster to write than the Parquet format. The disadvantage will be files that are 2-3x larger.

- 13,582
- 19
- 75
- 132
-
Totally agree. And the "snappy" compression available in pyarrow is just great! – Abramodj Apr 08 '19 at 13:04
-
It's always a good idea to run some benchmarks for your use case. I've had good results storing raw structs via numpy:
df.to_records().astype(mytype).tofile('mydata')
df = pd.DataFrame.from_records(np.fromfile('mydata', dtype=mytype))
It is quite fast and takes up less space on the disk. But: you'll need to keep track of the dtype to reload the data, it's not portable between architectures, and it doesn't support the advanced features of HDF5. (numpy has a more advanced binary format which is designed to overcome the first two limitations, but I haven't had much success getting it to work.)
Update: Thanks for pressing me for numbers. My benchmark indicates that indeed HDF5 wins, at least in my case. It's both faster and smaller on disk! Here's what I see with dataframe of about 280k rows, 7 float columns, and a string index:
In [15]: %timeit df.to_hdf('test_fixed.hdf', 'test', mode='w')
10 loops, best of 3: 172 ms per loop
In [17]: %timeit df.to_records().astype(mytype).tofile('raw_data')
1 loops, best of 3: 283 ms per loop
In [20]: %timeit pd.read_hdf('test_fixed.hdf', 'test')
10 loops, best of 3: 36.9 ms per loop
In [22]: %timeit pd.DataFrame.from_records(np.fromfile('raw_data', dtype=mytype))
10 loops, best of 3: 40.7 ms per loop
In [23]: ls -l raw_data test_fixed.hdf
-rw-r----- 1 altaurog altaurog 18167232 Apr 8 12:42 raw_data
-rw-r----- 1 altaurog altaurog 15537704 Apr 8 12:41 test_fixed.hdf

- 5,370
- 1
- 42
- 49
-
out of curiosity have you compared saving in ``.npz`` perf-wise with using HDF5? – Jeff Apr 08 '14 at 16:13
-
No, I haven't, and I'd be curious to see one. We use postgresql in production. I've been using the raw format only during development for passing data around. – Aryeh Leib Taurog Apr 08 '14 at 16:37
-
-
-
You can't get much simpler than `to_hdf()` and `read_hdf()`. The only complexity is the additional dependencies. – Aryeh Leib Taurog Apr 08 '14 at 17:09
HDF is indeed a very good choice, you can also use npy/npz with some caveats:
Here is a benchmark using a data frame of 25k rows and 1000 columns filled with random floats:
Saving to HDF took 0.49s
Saving to npy took 0.40s
Loading from HDF took 0.10s
Loading from npy took 0.061s
npy is about 20% faster to write and about 40% faster to read if you don't compress data.
Code used to generate the output above:
#!/usr/bin/python3
import pandas as pd
import random
import numpy as np
import time
start = time.time()
f = pd.DataFrame()
for i in range(1000):
f['col_{}'.format(i)] = np.random.rand(25000)
print('Generating data took {}s'.format(time.time() - start))
start = time.time()
f.to_hdf('frame.hdf', 'main', format='fixed')
print('Saving to HDF took {}s'.format(time.time() - start))
start = time.time()
np.savez('frame.npz', f.index, f.values)
print('Saving to npy took {}s'.format(time.time() - start))
start = time.time()
pd.read_hdf('frame.hdf')
print('Loading from HDF took {}s'.format(time.time() - start))
start = time.time()
index, values = np.load('frame.npz')
pd.DataFrame(values, index=index)
print('Loading from npy took {}s'.format(time.time() - start))

- 115
- 1
- 6
-
This isn't entirely clear what you are doing with `numpy`. Please can you elaborate which library and associated method you used for `numpy` and which library and method you used for `pandas` – Alexander McFarlane Oct 30 '16 at 15:51
-
1I added the code I used in my response. I hope it addresses you concern. – rahenri Nov 16 '16 at 01:49
If the priority is speed I would recommend:
- feather - the fastest
- parquet - a bit slower, but saves lots of disk space

- 137,073
- 23
- 153
- 219

- 1,614
- 16
- 13