How does one store a Pandas DataFrame as an HDF5 PyTables table (or CArray, EArray, etc.)?

Question

I have the following pandas dataframe:

import pandas as pd
df = pd.read_csv(filename.csv)

Now, I can use HDFStore to write the df object to file (like adding key-value pairs to a Python dictionary):

store = HDFStore('store.h5')
store['df'] = df

http://pandas.pydata.org/pandas-docs/stable/io.html

When I look at the contents, this object is a frame.

store

outputs

<class 'pandas.io.pytables.HDFStore'>
File path: store.h5
/df            frame        (shape->[552,23252])

However, in order to use indexing, one should store this as a table object.

My approach was to try HDFStore.put() i.e.

HDFStore.put(key="store.h", value=df, format=Table)

However, this fails with the error:

TypeError: put() missing 1 required positional argument: 'self'

How does one save Pandas Dataframes as PyTables tables?

score 5 · Accepted Answer · answered Jul 19 '16 at 14:04

5

common part - create or open existing HDFStore file:

store = pd.HDFStore('store.h5')

Try this if you want to have indexed all columns:

store.append('key_name', df, data_columns=True)

or this if you want to have indexed just a subset of columns:

store.append('key_name', df, data_columns=['colA','colC','colN'])

PS HDFStore.append() saves DFs per default in table format

answered Jul 19 '16 at 14:04

MaxU - stand with Ukraine

205,989
36
386
419

Thanks. How would you store this as a CArray or EArray? – JianguoHisiang Jul 19 '16 at 15:30
@JianguoHisiang, i don't have experience with CArray / EArray, so i can't answer your question in comment. But what is wrong with HDFStore? Did my answer help? – MaxU - stand with Ukraine Jul 19 '16 at 15:47
It's correct, yes. However, I am unsure how to query this 'frame_table' object (it isn't a 'table' object). Normally, you would try something like `result = [row for row in table.where('col1 > 100 & col1 < 200')]` if `col1` is indexed. – JianguoHisiang Jul 19 '16 at 15:54
@JianguoHisiang, you can simply do: `df = store.select(..., where='...')` or `df = pd.read_hdf(..., where='...')` – MaxU - stand with Ukraine Jul 19 '16 at 16:05
Is this as efficient as using compression with a PyTables table? http://www.pytables.org/usersguide/optimization.html I'm trying to do these queries with a `frame_table` the size of 800 GB, so it needs to be as efficient as possible. The only way to import this data was with a pandas dataframe, so I'm feeling a bit lost. – JianguoHisiang Jul 19 '16 at 16:13
With the example above, `df = store.select(..., where='...')` would be `df = store.select('key_name', where='colA==chr1 & colC==0')` – JianguoHisiang Jul 19 '16 at 16:17
Thanks for the help! – JianguoHisiang Jul 19 '16 at 16:48
Test different methods and compare the speed. [Here are results of my tests](http://stackoverflow.com/a/37012035/5741205)... I would suggest you to open another question and post there sample data but with the same amount of columns, same dtypes, similar shape, etc. and specify what operations do you want to speed up - reads/ writes / searches (with examples of those searches) / etc. So we could give you a neat answer – MaxU - stand with Ukraine Jul 19 '16 at 16:50

score 2 · Answer 2 · answered Sep 09 '17 at 05:00

How does one save Pandas Dataframes as PyTables tables?

Adding to the accepted answer, you should always close the PyTable file. For convenience, Pandas provides the HDFStore as a context manager:

with pd.HDFStore('/path/to/data.hdf') as hdf:
   hdf.put(key="store.h", value=df, format='table', data_columns=True)

How does one store a Pandas DataFrame as an HDF5 PyTables table (or CArray, EArray, etc.)?

2 Answers2

Linked