I have a dataframe of 2Gb that is a write once, read many df.
I would like to use the df in pandas, therefore I was using df.read_hdf
and df.to_hdf
in a fixed format which works pretty fine in reading and writing.
However, the df is growing with more columns being added, so I would like to use the table format instead, so I can select the columns I need when reading the data. I thought this would give me a speed advantage, but from testing this doesn't seem to be the case.
This example:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randn(10000000,9),columns=list('ABCDEFGHI'))
%time df.to_hdf("temp.h5", "temp", format ="fixed", mode="w")
%time df.to_hdf("temp2.h5", "temp2", format="table", mode="w")
shows fixed format is slightly faster (6.8s vs 5.9 seconds on my machine).
Then reading the data (after a small break to make sure file has been fully saved):
%time x = pd.read_hdf("temp.h5", "temp")
%time y = pd.read_hdf("temp2.h5", "temp2")
%time z = pd.read_hdf("temp2.h5", "temp2", columns=list("ABC"))
Yields:
Wall time: 420 ms (fixed)
Wall time: 557 ms (format)
Wall time: 671 ms (format, specified columns)
I do understand the fixed format is faster in reading the data, but why is the df with specified columns slower than reading the full dataframe? What is the benefit of using table formatting (with or without specified columns) over fixed formatting?
Is there maybe a memory advantage when the df is growing even bigger?