24

Pandas has the following examples for how to store Series, DataFrames and Panelsin HDF5 files:

Prepare some data:

In [1142]: store = HDFStore('store.h5')

In [1143]: index = date_range('1/1/2000', periods=8)

In [1144]: s = Series(randn(5), index=['a', 'b', 'c', 'd', 'e'])

In [1145]: df = DataFrame(randn(8, 3), index=index,
   ......:                columns=['A', 'B', 'C'])
   ......:

In [1146]: wp = Panel(randn(2, 5, 4), items=['Item1', 'Item2'],
   ......:            major_axis=date_range('1/1/2000', periods=5),
   ......:            minor_axis=['A', 'B', 'C', 'D'])
   ......:

Save it in a store:

In [1147]: store['s'] = s

In [1148]: store['df'] = df

In [1149]: store['wp'] = wp

Inspect what's in the store:

In [1150]: store
Out[1150]: 
<class 'pandas.io.pytables.HDFStore'>
File path: store.h5
/df            frame        (shape->[8,3])  
/s             series       (shape->[5])    
/wp            wide         (shape->[2,5,4])

Close the store:

In [1151]: store.close()

Questions:

  1. In the code above, when is the data actually written to disk?

  2. Say I want to add thousands of large dataframes living in .csv files to a single .h5 file. I would need to load them and add them to the .h5 file one by one since I cannot afford to have them all in memory at once as they would take too much memory. Is this possible with HDF5? What would be the correct way to do it?

  3. The Pandas documentation says the following:

    "These stores are not appendable once written (though you simply remove them and rewrite). Nor are they queryable; they must be retrieved in their entirety."

    What does it mean by not appendable nor queryable? Also, shouldn't it say once closed instead of written?

Amelio Vazquez-Reina
  • 91,494
  • 132
  • 359
  • 564

2 Answers2

15
  1. As soon as the statement is exectued, eg store['df'] = df. The close just closes the actual file (which will be closed for you if the process exists, but will print a warning message)

  2. Read the section http://pandas.pydata.org/pandas-docs/dev/io.html#storing-in-table-format

    It is generally not a good idea to put a LOT of nodes in an .h5 file. You probably want to append and create a smaller number of nodes.

    You can just iterate thru your .csv and store/append them one by one. Something like:

    for f in files:
      df = pd.read_csv(f)
      df.to_hdf('file.h5',f,df)
    

    Would be one way (creating a separate node for each file)

  3. Not appendable - once you write it, you can only retrieve it all at once, e.g. you cannot select a sub-section

    If you have a table, then you can do things like:

    pd.read_hdf('my_store.h5','a_table_node',['index>100'])
    

    which is like a database query, only getting part of the data

    Thus, a store is not appendable, nor queryable, while a table is both.

Amelio Vazquez-Reina
  • 91,494
  • 132
  • 359
  • 564
Jeff
  • 125,376
  • 21
  • 220
  • 187
  • Thanks, I am a bit confused about your answer to 2). Why do you pass `f` to `df.to_hdf`? Also, is there any difference between `df.to_hdf` and `store['df'] = df` ? Finally, I am confused about how nodes relate to the code above. Does `store['df'] = df` necessarily create a new node in the file? Is there a way to add multiple DataFrames to a single hdf5 within a single node? – Amelio Vazquez-Reina May 19 '13 at 17:48
  • 2
    the node name is ``f``, ``store[f] = pd.read_csv(f)`` is equivalent to ``df.to_hdf`, but the ``df.to_hdf`` auto opens/closes the store for you. ``store['df'] = df`` creates/overwrites the node named ``'df'``. FYI not a good idea to keep doing this, see the section on deleting data. A node can only hold one object (e.g. a frame), but you can create a hierarchy of nodes if you want (e.g. ``node_a/df, node_a/sub_node/df`` effectively holds multiple frames in a single node – Jeff May 19 '13 at 18:14
  • Thanks so much Jeff. By the way, What imports/packages do I need to use `HDFStore()`, append tables, and use `read/write_hdf` in Pandas? – Amelio Vazquez-Reina May 19 '13 at 18:25
  • 1
    http://pandas.pydata.org/pandas-docs/dev/install.html#optional-dependencies, numexpr & PyTables – Jeff May 19 '13 at 18:26
  • Fantastic. Thanks Jeff. Would you mind expanding a bit on why having too many nodes in an `.h5` is a bad idea? (e.g. would 500K be too many?). Besides appending dfs in tables, is there any other way to group 100Ks of dataframes in a single `.h5` file without creating too many nodes? – Amelio Vazquez-Reina May 19 '13 at 19:38
  • access times can degrade with that many nodes; u r much better off have much bigger tables with less rows. how big is each csv (rows/cols) roughly? are they related? what is your access patter for retrieving data? – Jeff May 19 '13 at 19:57
  • 2
    heres a link 2 some recipes: http://pandas.pydata.org/pandas-docs/dev/cookbook.html#hdfstore – Jeff May 19 '13 at 20:14
  • Very helpful Jeff. Thanks! By the way, I am running into some odd problems when following this route when saving a single DataFrame: http://stackoverflow.com/questions/16639503/unable-to-save-dataframe-to-hdf5-object-header-message-is-too-large – Amelio Vazquez-Reina May 19 '13 at 21:16
4

Answering question 2, with pandas 0.18.0 you can do:

store = pd.HDFStore('compiled_measurements.h5')
for filepath in file_iterator:
    raw = pd.read_csv(filepath)
    store.append('measurements', raw, index=False)

store.create_table_index('measurements', columns=['a', 'b', 'c'], optlevel=9, kind='full')
store.close()

Based on this part of the docs.

Depending on how much data you have, the index creation can consume enormous amounts of memory. The PyTables docs describes the values of optlevel.

Pablo
  • 983
  • 10
  • 24