4

I have a multidimensional pandas dataframe created like this:

import numpy as np
import pandas as pd
iterables = [['bar', 'baz', 'foo', 'qux'], ['one', 'two']]
mindex = pd.MultiIndex.from_product(iterables, names=['first', 'second'])
df = pd.DataFrame(np.random.randn(8, 4), index=mindex)
store = pd.HDFStore("df.h5")
store["df"] = df
store.close()

I would like to add attributes to df stored in the HDFStore. How can I do this? There doesn't seem to be any documentation regarding the attributes, and the group that is used to store the df is not of the same type as the HDF5 Group in the h5py module:

type(list(store.groups())[0])
Out[24]: tables.group.Group

It seems to be the pytables group, that has only this private member function that concerns some other kind of attribute:

__setattr__(self, name, value)
 |      Set a Python attribute called name with the given value.

What I would like is to simply store a bunch of DataFrames with multidimensional indices that are "marked" by attributes in a structured way, so that I can compare them and sub-select them based on those attributes.

Basically what HDF5 is meant to be used for + multidim DataFrames from pandas.

There are questions like this one, that deal with reading HDF5 files with other readers than pandas, but they all have DataFrames with one-dim indices, which makes it easy to simply dump numpy ndarrays, and store the index additionally.

tmaric
  • 5,347
  • 4
  • 42
  • 75
  • 1
    Regardless of what those other questions demonstrate, you should be able to explore your own `table` output with `h5dump` or `h5py`. My memory is the the pd tables layout is quite complex. – hpaulj Aug 23 '18 at 17:34
  • @hpaulj: Well, I am reading about h5py and there it is quite easy to add attributes to datasets and groups, and it seems that pandas doesnt support this. Ordinary tables are easy to store, but I still haven't figured out what to do with the multidim dframes. I don't know how to add an attribute to a group given by pandas df stored with HDFStore. – tmaric Aug 23 '18 at 19:48
  • 1
    Maybe you can use h5py to add attributes after the fact. – hpaulj Aug 23 '18 at 20:02
  • @hpaulj: yes, that's what I am trying out now, use pandas.HDFStore to drop the multidim dframes into HDF5, then read the file with h5py and add attributes to the dframe groups. – tmaric Aug 23 '18 at 20:12

2 Answers2

3

I haven't gotten any answers so far, and this is what I managed to do using both the pandas and the h5py modules: pandas is used to store and read the multidimensional DataFrame, and h5py to store and read the attributes of the HDF5 group:

import numpy as np
import pandas as pd
import h5py

# Create a random multidim DataFrame
iterables = [['bar', 'baz', 'foo', 'qux'], ['one', 'two']]
mindex = pd.MultiIndex.from_product(iterables, names=['first', 'second'])
df = pd.DataFrame(np.random.randn(8, 4), index=mindex)

pdStore = pd.HDFStore("df.h5")
h5pyFile = h5py.File("df.h5")

# Dumping the data and storing the attributes
pdStore["df"] = df
h5pyFile["/df"].attrs["number"] = 1

# Reading the data conditionally based on stored attributes.
dfg = h5pyFile["/df"]
readDf = pd.DataFrame()
if dfg.attrs["number"] == 1:
    readDf = pdStore["/df"]

print (readDf - df)
h5pyFile.close()
pdStore.close()

I still don't know if there are any issues in having both the h5py and pandas handling the .h5 file simultaneously.

tmaric
  • 5,347
  • 4
  • 42
  • 75
  • 2
    Regarding using `h5py` and `pandas` at the same time, I think that potentially could lead to problems. `pandas` uses `PyTables` under the hood, and they dont support concurrency, according to their FAQ. https://www.pytables.org/FAQ.html#can-pytables-be-used-in-concurrent-access-scenarios So I guess it is a good practise to close the `HDFStore` before opening the `h5py.File` – LudvigH Feb 25 '20 at 09:07
  • Any more progress on this at all? I'm about to run into the wall on this full speed, so I'd like to learn from the best :) – J.Hirsch Dec 08 '20 at 14:45
  • 1
    Not really, I gave up on HDF5 because the C API is too cumbersome and the MPI support in h5py was not stable at that time. I ended up using pandas.DataFrame with MultiIndex and packing all the metadata into column data. This makes each row uniquely identifiable, and it is easy to use MultiIndex to splice the DataFrame. Good luck! Post an answer here if you figure out how to use HDF5 :) – tmaric Dec 09 '20 at 15:12
0

Adding attributes to a group from within pandas seems to be available by now (could not find out since which release, tested code snippet with pandas 1.4.2 and Python 3.10.4). According to pandas' HDF cookbook the following approach can be used:

import numpy as np
import pandas as pd

df = pd.DataFrame(np.random.randn(8, 3))
store = pd.HDFStore("test.h5")
store.put("df", df)
store.get_storer("df").attrs.my_attribute = {"A": 10}
store.close()

The HDFStore() does provide a contextmanager as well:

with pd.HDFStore("test.h5") as store:
    store.put("df", df)
    store.get_storer("df").attrs.my_attribute = {"A": 10}

Please mind, that the attribute's name can be set as you like (data_origin in the following) and does not need to be a dictionary mandatorily:

store.get_storer("df").attrs.data_origin = 'random data generation'
albert
  • 8,027
  • 10
  • 48
  • 84