Fast reading of specified columns in df using pandas.to_hdf

Question

I have a dataframe of 2Gb that is a write once, read many df. I would like to use the df in pandas, therefore I was using df.read_hdf and df.to_hdf in a fixed format which works pretty fine in reading and writing.

However, the df is growing with more columns being added, so I would like to use the table format instead, so I can select the columns I need when reading the data. I thought this would give me a speed advantage, but from testing this doesn't seem to be the case.

This example:

import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randn(10000000,9),columns=list('ABCDEFGHI'))
%time df.to_hdf("temp.h5", "temp", format ="fixed", mode="w")
%time df.to_hdf("temp2.h5", "temp2", format="table", mode="w")

shows fixed format is slightly faster (6.8s vs 5.9 seconds on my machine).

Then reading the data (after a small break to make sure file has been fully saved):

%time x = pd.read_hdf("temp.h5", "temp")
%time y = pd.read_hdf("temp2.h5", "temp2")
%time z = pd.read_hdf("temp2.h5", "temp2", columns=list("ABC"))

Yields:

Wall time: 420 ms (fixed)   
Wall time: 557 ms (format)   
Wall time: 671 ms (format, specified columns)

I do understand the fixed format is faster in reading the data, but why is the df with specified columns slower than reading the full dataframe? What is the benefit of using table formatting (with or without specified columns) over fixed formatting?

Is there maybe a memory advantage when the df is growing even bigger?

MaxU - stand with Ukraine · Accepted Answer · 2017-02-03T12:19:21.823

4

IMO the main advantage of using format='table' in conjunction with data_columns=[list_of_indexed_columns] is the ability to conditionally (see where="where clause" parameter) read huge HDF5 files. So that you can filter your data while reading and process your data in chunks to avoid MemoryError.

You can try to save single columns or column groups (those that most of the time will be read together) in different HDF files or in the same file with different keys.

I'd also consider using "cutting-edge" technology - Feather-Format

Tests and timing:

import feather

writing to disk in three formats: (HDF5 fixed, HDF% table, Feather)

df = pd.DataFrame(np.random.randn(10000000,9),columns=list('ABCDEFGHI'))
df.to_hdf('c:/temp/fixed.h5', 'temp', format='f', mode='w')
df.to_hdf('c:/temp/tab.h5', 'temp', format='t', mode='w')
feather.write_dataframe(df, 'c:/temp/df.feather')

reading from disk:

In [122]: %timeit pd.read_hdf(r'C:\Temp\fixed.h5', "temp")
1 loop, best of 3: 409 ms per loop

In [123]: %timeit pd.read_hdf(r'C:\Temp\tab.h5', "temp")
1 loop, best of 3: 558 ms per loop

In [124]: %timeit pd.read_hdf(r'C:\Temp\tab.h5', "temp", columns=list('BDF'))
The slowest run took 4.60 times longer than the fastest. This could mean that an intermediate result is being cached.
1 loop, best of 3: 689 ms per loop

In [125]: %timeit feather.read_dataframe('c:/temp/df.feather')
The slowest run took 6.92 times longer than the fastest. This could mean that an intermediate result is being cached.
1 loop, best of 3: 644 ms per loop

In [126]: %timeit feather.read_dataframe('c:/temp/df.feather', columns=list('BDF'))
1 loop, best of 3: 218 ms per loop  # WINNER !!!

PS if you encounter the following error when using feather.write_dataframe(...):

FeatherError: Invalid: no support for strided data yet

here is a workaround:

df = df.copy()

after that feather.write_dataframe(df, path) should work properly...

edited Feb 03 '17 at 12:19

answered Feb 03 '17 at 10:27

MaxU - stand with Ukraine

205,989
36
386
419

Thank you. Could you elaborate on how to save the transposed data? Does it mean I should save columns with separate keys, but to the same hdf file? Could you maybe give an example? – user6538642 Feb 03 '17 at 10:39
@user6538642, what is your "usual" approx. DF shape? – MaxU - stand with Ukraine Feb 03 '17 at 11:41
1

shape = (6mln*50), number of columns expanding up to a few hundred. The index is a multi-index (date (in datetime format) and name) – user6538642 Feb 03 '17 at 11:49
@user6538642, are you always (or most of the time) reading __all__ rows? – MaxU - stand with Ukraine Feb 03 '17 at 11:50
currently I am selecting all rows and filter in pandas afterwards. Ideally, I'd be reading a sample, but this will always be at least 1 mln rows – user6538642 Feb 03 '17 at 11:51
@user6538642, i've added some examples and timing - please check – MaxU - stand with Ukraine Feb 03 '17 at 12:20
Looks interesting, I tried installing Feather, but got an error trying pip _(error: command 'gcc' failed: No such file or directory)_ and conda _(UnsatisfiableError: The following specifications were found to be in conflict: - feather-format - python 2.7* Use "conda info " to see the dependencies for each package.)_ Any clues how to get this solved? – user6538642 Feb 03 '17 at 12:44
i'm using Anaconda3 (Python 3.5.2) and i didn't have those issues. – MaxU - stand with Ukraine Feb 03 '17 at 12:55
did you try to install it this way: `conda install feather-format -c conda-forge`? – MaxU - stand with Ukraine Feb 03 '17 at 13:02
Yes. And when it failed using conda I tried `pip install feather-format` – user6538642 Feb 03 '17 at 13:12
It looks like feather is not supported for Windows in combination with Python 2.7, that's why I can't install it. Find more information in this link: (https://github.com/wesm/feather/issues/151) – user6538642 Feb 06 '17 at 08:48
@user6538642, thanks for the info, i didn't know about this issue – MaxU - stand with Ukraine Feb 06 '17 at 12:21

Fast reading of specified columns in df using pandas.to_hdf

1 Answers1

Linked