Pandas read_hdf very slow for non-numeric data

Question

When reading a large hdf file with pandas.read_hdf() I get extremely slow read time. My hdf has 50 million rows, 3 columns with integers and 2 with strings. Writing this using to_hdf() with table format and indexing took almost 10 minutes. While this is also slow, I am not too concerned as read speed is more important.

I have tried saving as fixed/table format, with/without compression, however the read time ranges between 2-5 minutes. By comparison, read_csv() on the same data takes 4 minutes.

I have also tried to read the hdf using pytables directly. This is much faster at 6 seconds and this would be the speed I would like to see.

h5file = tables.open_file("data.h5", "r")
table = h5file.root.data.table.read()

I noticed all the speed comparisons in the documentation use only numeric data and running these myself achieved similar performance.

I would like to ask whether there is something I can do to optimise read performance?

Edit

Here is a sample of the data

               col_A     col_B    col_C     col_D                 col_E
30649671  1159660800  10217383        0  10596000                LACKEY
26198715  1249084800   0921720        0         0           KEY CLIFTON
19251910   752112000   0827092      104    243000                WEMPLE
47636877  1464739200  06247715        0         0                 FLOYD
14121495  1233446400  05133815        0    988000        OGU ALLYN CH 9
41171050  1314835200  7C140009        0     39000             DEBERRY A
45865543  1459468800   0314892       76    254000               SABRINA
13387355   970358400  04140585       19   6956000              LA PERLA
4186815    849398400  02039719        0  19208000  NPU UNIONSPIELHAGAN1
32666568   733622400  10072006        0   1074000                 BROWN

And info on the dataframe:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 52046850 entries, 0 to 52046849
Data columns (total 5 columns):
col_A        int64
col_B        object
col_C        int64
col_D        int64
col_E        object
dtypes: int64(3), object(2)
memory usage: 1.9+ GB

can you provide a small sample data set (for example `df.head(10)`)? — MaxU - stand with Ukraine, May 09 '17 at 14:22
I have edited the question and placed a sample of the data and info on the dataframe — kayoz, May 09 '17 at 15:30
do you have to work with HDF format or can you also consider different formats? Do you always need to read the __whole__ data set or you need to read just part of it (conditionally)? — MaxU - stand with Ukraine, May 09 '17 at 15:38
I do not have to work with HDF. In fact I am trying to find the best format for fastest read speeds. After seeing a lot of comments on how fast HDF is, I thought this would be the solution. Do you have any other suggestions? — kayoz, May 09 '17 at 15:40
i'm writing an answer with some examples... Do you need to read your data conditionally? I.e. are you going to use `where='...query...'` parameter when reading HDF files? — MaxU - stand with Ukraine, May 09 '17 at 15:48
That would be a nice to have if it speeds up the data retrieval. A typical use case would never likely require all the data. But if in-memory processing with pandas is faster, I would be happy to accept that. — kayoz, May 09 '17 at 15:53

score 7 · Accepted Answer · edited May 23 '17 at 12:26

Here is a small demo:

Generating sample DF (1M rows):

N = 10**6

df = pd.DataFrame({
    'n1': np.random.randint(10**6, size=N),
    'n2': np.random.randint(10**6, size=N),
    'n3': np.random.randint(10**6, size=N),
    's1': pd.util.testing.rands_array(10, size=N),
    's2': pd.util.testing.rands_array(40, size=N),
})

let's write it to disk in CSV, HDF5 (fixed, table and table + data_columns=True) and in Feather formats

df.to_csv(r'c:/tmp/test.csv', index=False)
df.to_hdf(r'c:/tmp/test_fix.h5', 'a')
df.to_hdf(r'c:/tmp/test_tab.h5', 'a', format='t')
df.to_hdf(r'c:/tmp/test_tab_idx.h5', 'a', format='t', data_columns=True)

import feather
feather.write_dataframe(df, 'c:/tmp/test.feather')

Reading:

In [2]: %timeit pd.read_csv(r'c:/tmp/test.csv')
1 loop, best of 3: 4.48 s per loop

In [3]: %timeit pd.read_hdf(r'c:/tmp/test_fix.h5','a')
1 loop, best of 3: 1.24 s per loop

In [4]: %timeit pd.read_hdf(r'c:/tmp/test_tab.h5','a')
1 loop, best of 3: 5.65 s per loop

In [5]: %timeit pd.read_hdf(r'c:/tmp/test_tab_idx.h5','a')
1 loop, best of 3: 5.6 s per loop

In [6]: %timeit feather.read_dataframe(r'c:/tmp/test.feather')
1 loop, best of 3: 589 ms per loop

conditional reading - let's select only those rows where n2 <= 100000

In [7]: %timeit pd.read_hdf(r'c:/tmp/test_tab_idx.h5','a', where="n2 <= 100000")
1 loop, best of 3: 1.18 s per loop

the less data we need to select (after filtering) - the faster it is:

In [8]: %timeit pd.read_hdf(r'c:/tmp/test_tab_idx.h5','a', where="n2 <= 100000 and n1 > 500000")
1 loop, best of 3: 763 ms per loop

In [10]: %timeit pd.read_hdf(r'c:/tmp/test_tab_idx.h5','a', where="n2 <= 100000 and n1 > 500000 and n3 < 50000")
1 loop, best of 3: 379 ms per loop

UPDATE: for Pandas versions 0.20.0+ there we can write and read directly to/from feather format (thanks @jezrael for the hint):

In [3]: df.to_feather(r'c:/tmp/test2.feather')

In [4]: %timeit pd.read_feather(r'c:/tmp/test2.feather')
1 loop, best of 3: 583 ms per loop

Example of generated DF:

In [13]: df
Out[13]:
            n1      n2      n3          s1                                        s2
0       719458  808047  792611  Fjv4CoRv2b  2aWQTkutPlKkO38fRQh2tdh1BrnEFavmIsDZK17V
1       526092  950709  804869  dfG12EpzVI  YVZzhMi9sfazZEW9e2TV7QIvldYj2RPHw0TXxS2z
2       109107  801344  266732  aoyBuHTL9I  ui0PKJO8cQJwcvmMThb08agWL1UyRumYgB7jjmcw
3       873626  814409  895382  qQQms5pTGq  zvf4HTaKCISrdPK98ROtqPqpsG4WhSdEgbKNHy05
4       212776  596713  924623  3YXa4PViAn  7Y94ykHIHIEnjKvGphYfAWSINRZtJ99fCPiMrfzl
5       375323  401029  973262  j6QQwYzfsK  PNYOM2GpHdhrz9NCCifRsn8gIZkLHecjlk82o44Y
6       232655  937230   40883  NsI5Y78aLT  qiKvXcAdPVbhWbXnyD3uqIwzS7ZsCgssm9kHAETb
7        69010  438280  564194  N73tQaZjey  ttj1IHtjPyssyADMYiNScflBjN4SFv5bk3tbz93o
8       988081    8992  968871  eb9lc7D22T  sb3dt1Ndc8CUHyvsFJgWRrQg4ula7KJ76KrSSqGH
9       127155   66042  881861  tHSBB3RsNH  ZpZt5sxAU3zfiPniSzuJYrwtrytDvqJ1WflJ4vh3
...        ...     ...     ...         ...                                       ...
999990  805220   21746  355944  IMCMWuf97L  bj7tSrgudA5wLvWkWVQyNVamSGmFGOeQlIUoKXK3
999991  232596  293850  741881  JD0SVS5uob  kWeP8DEw19rwxVN3XBBcskibMRGxfoToNO9RDeCT
999992  532752  733958  222003  9X4PopnltN  dKhsdKFK1EfAATBFsB5hjKZzQWERxzxGEQZWAvSe
999993  308623  717897  703895  Fg0nuq63hA  kHzRecZoaG5tAnLbtlq1hqtfd2l5oEMFbJp4NjhC
999994  841670  528518   70745  vKQDiAzZNf  M5wdoUNfkdKX2VKQEArvBLYl5lnTNShjDLwnb8VE
999995  986988  599807  901853  r8iHjo39NH  72CfzCycAGoYMocbw3EbUbrV4LRowFjSDoDeYfT5
999996  384064  429184  203230  EJy0mTAmdQ  1jfUQCj2SLIktVqIRHfYQW2QYfpvhcWCbRLO5wqL
999997  967270  565677  146418  KWp2nH1MbM  hzhn880cuEpjFhd5bd7vpgsjjRNgaViANW9FHwrf
999998  130864  863893    5614  L28QGa22f1  zfg8mBidk8NTa3LKO4rg31Z6K4ljK50q5tHHq8Fh
999999  528532  276698  553870  0XRJwqBAWX  0EzNcDkGUFklcbKELtcr36zPCMu9lSaIDcmm0kUX

[1000000 rows x 5 columns]

In new version of pandas have new functions for `feathers` - [link](http://pandas.pydata.org/pandas-docs/stable/io.html#io-feather) — jezrael, May 09 '17 at 16:27
So I think you can add timings for it if possible upgrade your pandas. — jezrael, May 09 '17 at 16:43
Thanks for this @MaxU, I have run the same tests with a more comparable file size and I am getting better read speeds of 30 seconds. I will go back to my original data to find out what might be causing the slowness and report back. I am still curious as to why `read_hdf` takes so much longer than pytables `open` though. ~30x in this case. — kayoz, May 09 '17 at 22:10
Thanks again @MaxU for running these tests. Unfortunately I could not get anywhere near the same performance on my real data after doing extensive tests. I've decided to opt for a database solution instead. — kayoz, May 17 '17 at 09:52

Pandas read_hdf very slow for non-numeric data

1 Answers1