3

I have a dataframe of 2Gb that is a write once, read many df. I would like to use the df in pandas, therefore I was using df.read_hdf and df.to_hdf in a fixed format which works pretty fine in reading and writing.

However, the df is growing with more columns being added, so I would like to use the table format instead, so I can select the columns I need when reading the data. I thought this would give me a speed advantage, but from testing this doesn't seem to be the case.

This example:

import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randn(10000000,9),columns=list('ABCDEFGHI'))
%time df.to_hdf("temp.h5", "temp", format ="fixed", mode="w")
%time df.to_hdf("temp2.h5", "temp2", format="table", mode="w")

shows fixed format is slightly faster (6.8s vs 5.9 seconds on my machine).

Then reading the data (after a small break to make sure file has been fully saved):

%time x = pd.read_hdf("temp.h5", "temp")
%time y = pd.read_hdf("temp2.h5", "temp2")
%time z = pd.read_hdf("temp2.h5", "temp2", columns=list("ABC"))

Yields:

Wall time: 420 ms (fixed)   
Wall time: 557 ms (format)   
Wall time: 671 ms (format, specified columns)

I do understand the fixed format is faster in reading the data, but why is the df with specified columns slower than reading the full dataframe? What is the benefit of using table formatting (with or without specified columns) over fixed formatting?

Is there maybe a memory advantage when the df is growing even bigger?

user
  • 5,370
  • 8
  • 47
  • 75
user6538642
  • 176
  • 1
  • 15

1 Answers1

4

IMO the main advantage of using format='table' in conjunction with data_columns=[list_of_indexed_columns] is the ability to conditionally (see where="where clause" parameter) read huge HDF5 files. So that you can filter your data while reading and process your data in chunks to avoid MemoryError.

You can try to save single columns or column groups (those that most of the time will be read together) in different HDF files or in the same file with different keys.

I'd also consider using "cutting-edge" technology - Feather-Format

Tests and timing:

import feather

writing to disk in three formats: (HDF5 fixed, HDF% table, Feather)

df = pd.DataFrame(np.random.randn(10000000,9),columns=list('ABCDEFGHI'))
df.to_hdf('c:/temp/fixed.h5', 'temp', format='f', mode='w')
df.to_hdf('c:/temp/tab.h5', 'temp', format='t', mode='w')
feather.write_dataframe(df, 'c:/temp/df.feather')

reading from disk:

In [122]: %timeit pd.read_hdf(r'C:\Temp\fixed.h5', "temp")
1 loop, best of 3: 409 ms per loop

In [123]: %timeit pd.read_hdf(r'C:\Temp\tab.h5', "temp")
1 loop, best of 3: 558 ms per loop

In [124]: %timeit pd.read_hdf(r'C:\Temp\tab.h5', "temp", columns=list('BDF'))
The slowest run took 4.60 times longer than the fastest. This could mean that an intermediate result is being cached.
1 loop, best of 3: 689 ms per loop

In [125]: %timeit feather.read_dataframe('c:/temp/df.feather')
The slowest run took 6.92 times longer than the fastest. This could mean that an intermediate result is being cached.
1 loop, best of 3: 644 ms per loop

In [126]: %timeit feather.read_dataframe('c:/temp/df.feather', columns=list('BDF'))
1 loop, best of 3: 218 ms per loop  # WINNER !!!

PS if you encounter the following error when using feather.write_dataframe(...):

FeatherError: Invalid: no support for strided data yet 

here is a workaround:

df = df.copy()

after that feather.write_dataframe(df, path) should work properly...

MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419