I have some problems with pandas' HDFStore
being far to slow and unfortunately I'm unable to put together a satisfying solution from other questions here.
Situation
I have a big DataFrame, containing mostly floats and sometimes integer columns which goes through multiple processing steps (renaming, removing bad entries, aggregating by 30min). Each row has a timestamp associated to it. I would like to save some middle steps to a HDF file, so that the user can do a single step iteratively without starting from scratch each time.
Additionally the user should be able to plot certain column from these saves in order to select bad data. Therefore I would like to retrieve only the column names without reading the data in the HDFStore. Concretely the user should get a list of all columns of all dataframes stored in the HDF then they should select which columns they would like to see whereafter I use matplotlib to present them the corresponding data.
Datashape == (5730000, 339)
does not seem large at all, that's why I'm confused... (Might get far more rows over time, columns should stay fixed)
In the first step I append iteratively rows and columns (that runs okay), but once that's done I always process the entire DataFrame at once, only grouping or removing data.
My approach
- I do all manipulations in memory since pandas seems to be rather fast and I/O is slower (HDF is on different physical server, I think)
- I use datetime index and automatically selected float or integer columns
- I save the steps with
hdf.put('/name', df, format='fixed')
sincehdf.put('/name'.format(grp), df, format='table', data_columns=True)
seemed to be far too slow. - I use e.g.
df.groupby(df.index).first()
anddf.groupby(pd.Grouper(freq='30Min')).agg(agg_dict)
to process the data, where agg_dict is a dictonary with one function per column. This is incredibly slow as well. - For plotting, I have to read-in the entire dataframe and then get the columns:
hdfstore.get('/name').columns
Question
- How can I retrieve all columns without reading any data from the HDFStore?
- What would be the most efficient way of storing my data? Is HDF the right option? Table or fixed?
- Does it matter in term of efficiency if the index is a datetime index? Does there exists a more efficient format in general (e.g. all columns the same, fixed dtype?)
- Is there a faster way to aggregate instead of
groupby
(df.groupby(pd.Grouper(freq='30Min')).agg(agg_dict)
)
similar questions
How to access single columns using .select
I see that I can use this to retrieve only certain columns but only after I know the column names, I think.
Thank you for any advice!