How to read HDF5 files in Python

Question

I am trying to read data from hdf5 file in Python. I can read the hdf5 file using h5py, but I cannot figure out how to access data within the file.

My code

import h5py    
import numpy as np    
f1 = h5py.File(file_name,'r+')

This works and the file is read. But how can I access data inside the file object f1?

If the file holds a Keras model, you will probably want to [load it with Keras](https://stackoverflow.com/questions/35074549/how-to-load-a-model-from-an-hdf5-file-in-keras) instead. — Josiah Yoder, Jun 20 '18 at 19:12
Is an `hdf5` file different from an `hdf` file? I have `hdf`s (they are several bands of images), but I cannot figure out how to open them. — mikey, Aug 11 '20 at 14:42
df = numpy.read_hdf(fileName.hdf5) -> this stores the data into a numpy dataframe that you can use. — Tanmoy, Oct 08 '21 at 13:00

score 206 · Accepted Answer · edited Jun 24 '22 at 18:18

Read HDF5

import h5py
filename = "file.hdf5"

with h5py.File(filename, "r") as f:
    # Print all root level object names (aka keys) 
    # these can be group or dataset names 
    print("Keys: %s" % f.keys())
    # get first object name/key; may or may NOT be a group
    a_group_key = list(f.keys())[0]

    # get the object type for a_group_key: usually group or dataset
    print(type(f[a_group_key])) 

    # If a_group_key is a group name, 
    # this gets the object names in the group and returns as a list
    data = list(f[a_group_key])

    # If a_group_key is a dataset name, 
    # this gets the dataset values and returns as a list
    data = list(f[a_group_key])
    # preferred methods to get dataset values:
    ds_obj = f[a_group_key]      # returns as a h5py dataset object
    ds_arr = f[a_group_key][()]  # returns as a numpy array

Write HDF5

import h5py

# Create random data
import numpy as np
data_matrix = np.random.uniform(-1, 1, size=(10, 3))

# Write data to HDF5
with h5py.File("file.hdf5", "w") as data_file:
    data_file.create_dataset("dataset_name", data=data_matrix)

See h5py docs for more information.

Alternatives

JSON: Nice for writing human-readable data; VERY commonly used (read & write)
CSV: Super simple format (read & write)
pickle: A Python serialization format (read & write)
MessagePack (Python package): More compact representation (read & write)
HDF5 (Python package): Nice for matrices (read & write)
XML: exists too *sigh* (read & write)

For your application, the following might be important:

Support by other programming languages
Reading / writing performance
Compactness (file size)

See also: Comparison of data serialization formats

In case you are rather looking for a way to make configuration files, you might want to read my short article Configuration files in Python

To get the data in the HDF5 datasets as a numpy array, you can do `f[key].value` — erickrf, May 04 '17 at 20:49
As of `h5py` version 2.1: "The property `Dataset.value`, which dates back to h5py 1.0, is deprecated and will be removed in a later release. This property dumps the entire dataset into a NumPy array. Code using `.value` should be updated to use NumPy indexing, using `mydataset[...]` or `mydataset[()]` as appropriate." — honey_badger, Feb 19 '20 at 14:31
I am using Julia's hdf5 library and the read operation is much faster (would include it as answer, but OP asked for python). The same hdf5 file read takes forever in h5py, however it is very manageable in Julia, worth learning to program in Julia just for this one problem. The only issue I had with Julia was that it didn't handle null terminated strings correctly, which for me was a bit of a roadblock. — demongolem, Mar 20 '20 at 10:53
Commenting on the answer itself, the list operation in the read version causes python to freeze. If I just do f[a_group_key] it works at the proper speed. — demongolem, Mar 20 '20 at 10:59
@demongolem: you should not use the listing of all keys of you already know which one you want to use. I have done it here to have a self-contained example that requires least amount of work to get something running. — Martin Thoma, Mar 20 '20 at 11:08

score 38 · Answer 2 · edited Jun 24 '22 at 18:15

38

Reading the file

import h5py

f = h5py.File(file_name, mode)

Studying the structure of the file by printing what HDF5 groups are present

for key in f.keys():
    print(key) #Names of the root level object names in HDF5 file - can be groups or datasets.
    print(type(f[key])) # get the object type: usually group or dataset

Extracting the data

#Get the HDF5 group; key needs to be a group name from above
group = f[key]

#Checkout what keys are inside that group.
for key in group.keys():
    print(key)

# This assumes group[some_key_inside_the_group] is a dataset, 
# and returns a np.array:
data = group[some_key_inside_the_group][()]
#Do whatever you want with data

#After you are done
f.close()

edited Jun 24 '22 at 18:15

kcw78

7,131
3
12
44

answered Mar 05 '18 at 17:02

Daksh

1,064
11
22

`for key in data.keys(): print(key) #Names of the groups in HDF5 file.` this can be replace by `list(data)` – Hitesh Apr 19 '18 at 07:00
5

to know exact structure with all variable use : `data.visit(print) ` – Hitesh Apr 19 '18 at 07:29
just fyi, the f in h5py.File(...) should be capitalized. – dannykim Aug 01 '18 at 17:47
1

@dannykim Done. – Daksh Aug 02 '18 at 01:13
2

Important: `data.close()` is needed at the end. – anilbey Oct 30 '18 at 20:17
ERROR python3.7/site-packages/h5py/_hl/dataset.py:313: H5pyDeprecationWarning: dataset.value has been deprecated. Use dataset[()] instead. "Use dataset[()] instead.", H5pyDeprecationWarning) – R Claven Jan 09 '19 at 04:11
1

It should be (the horrific new syntax): `data = group[some_key_inside_the_group][()]` – Bersan Apr 13 '21 at 13:27

score 28 · Answer 3 · answered Jan 27 '15 at 12:38

28

you can use Pandas.

import pandas as pd
pd.read_hdf(filename,key)

answered Jan 27 '15 at 12:38

Danny

451
3
6

8

You should not rely on the Pandas implementation unless you are storing dataframes. read_hdf relies on the HDF file to be in a certain structure; also there is no pd.write_hdf, so you could only use it one-way. See [this post](https://stackoverflow.com/questions/33641246/pandas-cant-read-hdf5-file-created-with-h5py/33644128#33644128). – Max Jan 26 '19 at 21:20
5

Pandas does have a writing function. See [pd.DataFrame.to_hdf](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_hdf.html) – Eric Taw Mar 17 '19 at 18:58

score 9 · Answer 4 · answered Apr 11 '19 at 03:59

Here's a simple function I just wrote which reads a .hdf5 file generated by the save_weights function in keras and returns a dict with layer names and weights:

def read_hdf5(path):

    weights = {}

    keys = []
    with h5py.File(path, 'r') as f: # open file
        f.visit(keys.append) # append all keys to list
        for key in keys:
            if ':' in key: # contains data if ':' in key
                print(f[key].name)
                weights[f[key].name] = f[key].value
    return weights

https://gist.github.com/Attila94/fb917e03b04035f3737cc8860d9e9f9b.

Haven't tested it thoroughly but does the job for me.

This function seems to display all the contents in the .h5 file. Thanks. — minTwin, Sep 02 '20 at 19:38

score 7 · Answer 5 · answered Jan 25 '19 at 04:51

7

To read the content of .hdf5 file as an array, you can do something as follow

> import numpy as np 
> myarray = np.fromfile('file.hdf5', dtype=float)
> print(myarray)

answered Jan 25 '19 at 04:51

Raza

189
1
7

score 5 · Answer 6 · edited Jun 24 '22 at 18:28

Use below code to data read and convert into numpy array

import h5py
f1 = h5py.File('data_1.h5', 'r')
list(f1.keys())
X1 = f1['x']
y1=f1['y']
df1= np.array(X1.value)
dfy1= np.array(y1.value)
print (df1.shape)
print (dfy1.shape)

Preferred method to read dataset values into a numpy array:

import h5py
# use Python file context manager:
with h5py.File('data_1.h5', 'r') as f1:
    print(list(f1.keys()))  # print list of root level objects
    # following assumes 'x' and 'y' are dataset objects
    ds_x1 = f1['x']  # returns h5py dataset object for 'x'
    ds_y1 = f1['y']  # returns h5py dataset object for 'y'
    arr_x1 = f1['x'][()]  # returns np.array for 'x'
    arr_y1 = f1['y'][()]  # returns np.array for 'y'
    arr_x1 = ds_x1[()]  # uses dataset object to get np.array for 'x'
    arr_y1 = ds_y1[()]  # uses dataset object to get np.array for 'y'
    print (arr_x1.shape)
    print (arr_y1.shape)

Don't forget to close the file, otherwise the file may get corrupted. — anilbey, Oct 30 '18 at 20:16
Thank you. It is probably the best way to open a .hdf5 data file. — Farzad Amirjavid, Dec 14 '20 at 23:24

score 2 · Answer 7 · edited Oct 09 '19 at 07:20

2

from keras.models import load_model 

h= load_model('FILE_NAME.h5')

edited Oct 09 '19 at 07:20

taras

6,566
10
39
50

answered Oct 09 '19 at 06:52

Judice

21
2

2

This is how we load a saved NN model in Keras. I think this question is more general and noting to do with Keras. – Upul Bandara Oct 09 '19 at 18:58
3

When all you have is a hammer, everything looks like a Nail :-). – Upul Bandara Oct 09 '19 at 19:01

score 2 · Answer 8 · answered Jul 15 '20 at 14:30

If you have named datasets in the hdf file then you can use the following code to read and convert these datasets in numpy arrays:

import h5py
file = h5py.File('filename.h5', 'r')

xdata = file.get('xdata')
xdata= np.array(xdata)

If your file is in a different directory you can add the path in front of'filename.h5'.

score 0 · Answer 9 · answered Jan 27 '15 at 12:45

0

What you need to do is create a dataset. If you take a look at the quickstart guide, it shows you that you need to use the file object in order to create a dataset. So, f.create_dataset and then you can read the data. This is explained in the docs.

answered Jan 27 '15 at 12:45

Games Brainiac

80,178
33
141
199

score 0 · Answer 10 · answered Dec 02 '19 at 05:58

Using bits of answers from this question and the latest doc, I was able to extract my numerical arrays using

import h5py
with h5py.File(filename, 'r') as h5f:
    h5x = h5f[list(h5f.keys())[0]]['x'][()]

Where 'x' is simply the X coordinate in my case.

score 0 · Answer 11 · answered Jun 24 '22 at 12:51

use this it works fine for me


    weights = {}

    keys = []
    with h5py.File("path.h5", 'r') as f: 
        f.visit(keys.append) 
        for key in keys:
            if ':' in key: 
                print(f[key].name)     
                weights[f[key].name] = f[key][()]
    return weights

print(read_hdf5())

if you are using the h5py<='2.9.0' then you can use


    weights = {}

    keys = []
    with h5py.File("path.h5", 'r') as f: 
        f.visit(keys.append) 
        for key in keys:
            if ':' in key: 
                print(f[key].name)     
                weights[f[key].name] = f[key].value
    return weights

print(read_hdf5())

Syrtis Major · Answer 12 · 2023-05-14T14:38:50.193

I recommend a wrapper of h5py, H5Attr, that allows you to load hdf5 data easily via attributes such as group.dataset (equivalent to the original group['dataset']) with IPython/Jupyter tab completion.

The code is here. Here are some use examples, you can try the code below yourself

# create example HDF5 file for this guide
import h5py, io
file = io.BytesIO()
with h5py.File(file, 'w') as fp:
    fp['0'] = [1, 2]
    fp['a'] = [3, 4]
    fp['b/c'] = 5
    fp.attrs['d'] = 's'

# import package
from h5attr import H5Attr

# open file
f = H5Attr(file)

# easy access to members, with tab completion in IPython/Jupyter
f.a, f['a']

# also work for subgroups, but note that f['b/c'] is more efficient
# because it does not create f['b']
f.b.c, f['b'].c, f['b/c']

# access to HDF5 attrs via a H5Attr wrapper
f._attrs.d, f._attrs['d']

# show summary of the data
f._show()
# 0   int64 (2,)
# a   int64 (2,)
# b/  1 members

# lazy (default) and non-lazy mode
f = H5Attr(file)
f.a  # <HDF5 dataset "a": shape (2,), type "<i8">

f = H5Attr(file, lazy=False)
f.a  # array([3, 4])

How to read HDF5 files in Python

My code

12 Answers12

Read HDF5

Write HDF5

Alternatives

Linked