Fastest file format for read/write operations with Pandas and/or Numpy

Question

I've been working for a while with very large DataFrames and I've been using the csv format to store input data and results. I've noticed that a lot of time goes into reading and writing these files which, for example, dramatically slows down batch processing of data. I was wondering if the file format itself is of relevance. Is there a preferred file format for faster reading/writing Pandas DataFrames and/or Numpy arrays?

2023_Update - You can find a comparison of the files access performance from Pandas here : https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#io-perf — Tirbo06, Jun 04 '23 at 09:09

score 25 · Accepted Answer · edited Sep 20 '19 at 10:36

25

Use HDF5. Beats writing flat files hands down. And you can query. Docs are here

Here's a perf comparison vs SQL. Updated to show SQL/HDF_fixed/HDF_table/CSV write and read perfs.

Docs now include a performance section:

See here

edited Sep 20 '19 at 10:36

PV8

5,799
7
43
87

answered Apr 08 '14 at 15:21

Jeff

125,376
21
220
187

2

I don't know if the performance section has been edited since 2014, but the results there now show that feather and pickle are faster than HDF. But HDF has wider adoption I think? – James Paul Mason Dec 06 '17 at 18:56
2

The link is moved to [https://pandas.pydata.org/pandas-docs/stable/io.html#performance-considerations](https://pandas.pydata.org/pandas-docs/stable/io.html#performance-considerations) – krismath Mar 07 '18 at 10:37

Rafael S. Calsaverini · Answer 2 · 2019-03-20T17:13:16.430

Recently pandas added support for the parquet format using as backend the library pyarrow (written by Wes Mckinney himself, with his usual obsession for performance).

You only need to install the pyarrow library and use the methods read_parquet and to_parquet. Parquet is much faster to read and write for bigger datasets (above a few hundred megabytes or more) and it also keep track of dtype metadata, so you won't loose data type information when writing and reading from disk. It can actually store more efficiently some datatypes that HDF5 are not very performant with (like strings and timestamps: HDF5 doesn't have a native data type for those, so it uses pickle to serialize them, which makes slow for big datasets).

Parquet is also a columnar format, which makes it very easy to do two things:

Fastly filter out columns that you're not interested in. With CSV you have to actually read the whole file and only after that you can throw away columns you don't want. With parquet you can actualy read only the columns you're interested.
Make queries filtering out rows and reading only what you care.

Another interesting recent development is the Feather file format, which is also developed by Wes Mckinney. It's essentially just an uncompressed arrow format written directly to disk, so it is potentially faster to write than the Parquet format. The disadvantage will be files that are 2-3x larger.

Totally agree. And the "snappy" compression available in pyarrow is just great! — Abramodj, Apr 08 '19 at 13:04
Agreed. Parquet easily beats HDF in terms of speed and file size. — Matthew Son, Sep 14 '19 at 03:35

Aryeh Leib Taurog · Answer 3 · 2014-04-08T16:52:50.547

It's always a good idea to run some benchmarks for your use case. I've had good results storing raw structs via numpy:

df.to_records().astype(mytype).tofile('mydata')
df = pd.DataFrame.from_records(np.fromfile('mydata', dtype=mytype))

It is quite fast and takes up less space on the disk. But: you'll need to keep track of the dtype to reload the data, it's not portable between architectures, and it doesn't support the advanced features of HDF5. (numpy has a more advanced binary format which is designed to overcome the first two limitations, but I haven't had much success getting it to work.)

Update: Thanks for pressing me for numbers. My benchmark indicates that indeed HDF5 wins, at least in my case. It's both faster and smaller on disk! Here's what I see with dataframe of about 280k rows, 7 float columns, and a string index:

In [15]: %timeit df.to_hdf('test_fixed.hdf', 'test', mode='w')
10 loops, best of 3: 172 ms per loop
In [17]: %timeit df.to_records().astype(mytype).tofile('raw_data')
1 loops, best of 3: 283 ms per loop
In [20]: %timeit pd.read_hdf('test_fixed.hdf', 'test')
10 loops, best of 3: 36.9 ms per loop
In [22]: %timeit pd.DataFrame.from_records(np.fromfile('raw_data', dtype=mytype))
10 loops, best of 3: 40.7 ms per loop
In [23]: ls -l raw_data test_fixed.hdf
-rw-r----- 1 altaurog altaurog 18167232 Apr  8 12:42 raw_data
-rw-r----- 1 altaurog altaurog 15537704 Apr  8 12:41 test_fixed.hdf

out of curiosity have you compared saving in ``.npz`` perf-wise with using HDF5? — Jeff, Apr 08 '14 at 16:13
No, I haven't, and I'd be curious to see one. We use postgresql in production. I've been using the raw format only during development for passing data around. — Aryeh Leib Taurog, Apr 08 '14 at 16:37
interesting; I have not used ``.npz``, but seems simple enough! — Jeff, Apr 08 '14 at 16:58
You can't get much simpler than `to_hdf()` and `read_hdf()`. The only complexity is the additional dependencies. — Aryeh Leib Taurog, Apr 08 '14 at 17:09

rahenri · Answer 4 · 2016-11-16T01:49:02.167

HDF is indeed a very good choice, you can also use npy/npz with some caveats:

Here is a benchmark using a data frame of 25k rows and 1000 columns filled with random floats:

Saving to HDF took 0.49s
Saving to npy took 0.40s
Loading from HDF took 0.10s
Loading from npy took 0.061s

npy is about 20% faster to write and about 40% faster to read if you don't compress data.

Code used to generate the output above:

#!/usr/bin/python3

import pandas as pd
import random
import numpy as np
import time

start = time.time()
f = pd.DataFrame()
for i in range(1000):
  f['col_{}'.format(i)] = np.random.rand(25000)
print('Generating data took {}s'.format(time.time() - start))

start = time.time()
f.to_hdf('frame.hdf', 'main', format='fixed')
print('Saving to HDF took {}s'.format(time.time() - start))

start = time.time()
np.savez('frame.npz', f.index, f.values)
print('Saving to npy took {}s'.format(time.time() - start))

start = time.time()
pd.read_hdf('frame.hdf')
print('Loading from HDF took {}s'.format(time.time() - start))

start = time.time()
index, values = np.load('frame.npz')
pd.DataFrame(values, index=index)
print('Loading from npy took {}s'.format(time.time() - start))

This isn't entirely clear what you are doing with `numpy`. Please can you elaborate which library and associated method you used for `numpy` and which library and method you used for `pandas` — Alexander McFarlane, Oct 30 '16 at 15:51
I added the code I used in my response. I hope it addresses you concern. — rahenri, Nov 16 '16 at 01:49

score 5 · Answer 5 · edited Jul 28 '20 at 20:08

5

If the priority is speed I would recommend:

feather - the fastest
parquet - a bit slower, but saves lots of disk space

edited Jul 28 '20 at 20:08

Jean-François Fabre

137,073
23
153
219

answered Jun 05 '20 at 19:19

artoby

1,614
16
13

what about pyarrow/parquet? – mike01010 Sep 27 '22 at 16:02

Fastest file format for read/write operations with Pandas and/or Numpy

5 Answers5

Linked