1

After searching a lot I couldn't find a simple way to extract data from .h5 and pass it to a data.Frame by Numpy or Pandas in order to save in .txt or .csv file.

import h5py
import numpy as np
import pandas as pd

filename = 'D:\data.h5'
f = h5py.File(filename, 'r')

# List all groups
print("Keys: %s" % f.keys())
a_group_key = list(f.keys())[0]

# Get the data
data = list(f[a_group_key])
pd.DataFrame(data).to_csv("hi.csv")
Keys: <KeysViewHDF5 ['dd48']>

When I print data I see following results:

print(data)
['axis0',
 'axis1',
 'block0_items',
 'block0_values',
 'block1_items',
 'block1_values']

I would appreciate the if someone explain me what are they and how I can extract data completely and save it in .csv file. It seems there hasn't been a routine way to do that and it's kind of challenging yet! Until now I just could see part of data via:

import numpy as np 
dfm = np.fromfile('D:\data.h5', dtype=float)
print (dfm.shape)
print(dfm[5:])

dfm=pd.to_csv('train.csv')
#dfm.to_csv('hi.csv', sep=',', header=None, index=None)

My expectation is to extract time_stamps and measurements in .h5 file.

Mario
  • 1,631
  • 2
  • 21
  • 51
  • This question is related to Python, but if you wanted a generic way to extract text data and other info from an HDF file, you can check out the **HDFView** application. – Jeromy Adofo Dec 10 '22 at 16:16

2 Answers2

0

It looks like that data was written by Pandas, so use pd.read_hdf() to read it.

John Zwinck
  • 239,568
  • 38
  • 324
  • 436
  • Hi, it doesn't work look : `reread = pd.read_hdf('D:\data.h5')` and due to `ImportError: HDFStore requires PyTables, "No module named 'tables'" problem importing` I updated/installed `pytables` by `pip install --upgrade tables` but right now I've faced to `ValueError: numpy.ufunc size changed, may indicate binary incompatibility. Expected 216 from C header, got 192 from PyObject`. Do you have any ideas? – Mario May 21 '19 at 13:14
  • 1
    @Mario, you may need an updated or clean installation of `pandas` and or `numpy`. If the `h5` was written with `pandas` and `pytables` it will be a lot easier to read it with the same tools. `h5py` is a lower level interface to the files, using only `numpy` arrays. So it can read the file, but building a dataframe from the arrays will be more work, and require more knowledge of pandas internals. – hpaulj May 21 '19 at 17:04
  • @hpaulj The `numpy` version that I used is `1.15.4` and by pip is `1.16.3` and the `Pandas` version that I used is `0.23.4` and by pip is `0.24.1` as it can be seen [here](https://i.imgur.com/qOxihLP.jpg). I can't update it due to it would be **incompatible** with Keras nad TF but thanks for your consideration. – Mario May 21 '19 at 17:26
  • @Mario: You simply need to install PyTables, probably using the same package manager you used to get Pandas. – John Zwinck May 22 '19 at 04:18
  • Mario, You can read the HDF5 file with any of the Python modules mentioned above (Pandas, PyTables, or h5py). I don't use Pandas, so can't help there. Pytables can also extract datasets to numpy arrays. If you are new to HDF5, I suggest installing **HDFView** from **The HDF Group** to see the data and structure inside the file. IMHO, "seeing" your data is very helpful until you master the coding. Also, if you only need to get a few datasets one time, it has an export tool that will write a CSV file, and you can skip the coding. – kcw78 May 22 '19 at 05:07
  • @kcw78: "Seeing" this data will not be very helpful, as it was written in a specific Pandas format which only really makes sense to Pandas. – John Zwinck May 22 '19 at 11:04
  • @John Zwick. I get it, similar to Matlab's unique schema. It's not so much the data as the data structure, specifically for new users. IMHO, I think the Groups/Dataset layout is easier to understand with a visual representation. – kcw78 May 22 '19 at 13:09
0

h5py will access HDF5 datasets as numpy arrays. Your call to get the keys returns a LIST of the dataset names. Now that you have them, it should be pretty simple to access them as a numpy array and write them. You need to get the dtype to know what is in each column to format correctly.

Updated 5/22/2019 to reflect content of data.h5 posted at link in comment. Default format in np.savetxt() is '%.18e'. Very simple (crude) logic provided to modify format based on dtype for these datasets. This requires more robust dtype checking and formatting for general use. Also, you will need to add logic to decode unicode strings.

import h5py
filename = 'D:\data.h5'
import numpy as np
h5f = h5py.File(filename, 'r')
# get a List of data sets in group 'dd48'
a_dset_keys = list(h5f['dd48'].keys())

# Get the data
for dset in a_dset_keys :
    ds_data = (h5f['dd48'][dset])
    print ('dataset=', dset)
    print (ds_data.dtype)
    if ds_data.dtype == 'float64' :
        csvfmt = '%.18e'
    elif ds_data.dtype == 'int64' :
        csvfmt = '%.10d'
    else:
        csvfmt = '%s'
    np.savetxt('output_'+dset+'.csv', ds_data, fmt=csvfmt, delimiter=',')
kcw78
  • 7,131
  • 3
  • 12
  • 44
  • Thanks for your reply. Considering this is [data.h5](https://drive.google.com/file/d/1WaBMYH9Ts5qTn93SIZX9p-FWlaQ5zd0s/view?usp=sharing), how should I configure the last line in your snippet to achieve the data in right and readable format in csv file? Would you form your complete answer by updating that. I would like to learn how you configure each column to format correctly. I also got error for `data = list(f[grp])` as `KeyError: "Unable to open object (object 'd' doesn't exist)"` – Mario May 21 '19 at 14:59
  • Can you explain me what are `['axis0', 'axis1', 'block0_items', 'block0_values', 'block1_items', 'block1_values']` ? how we call them technically and how we should we know about them in term of data extraction? They look like some folder which have some info. Would you explain them shortly in your answer. – Mario May 21 '19 at 15:05
  • Mario, these are the names of the top level Nodes in your file. A Node can be a Group or a Dataset. (I assumed they are datasets for my coding) You can use a test on `isinstance(data, h5py.Dataset):` to confirm datasets. They are accessed by name as shown in the code `data = list(f[grp])`. `data` is a dataset object with the name `grp` from the list `a_group_keys` you created. – kcw78 May 22 '19 at 00:11
  • thanks for your explanations. As I uploaded `data.h5` in 1st comment I would like convert data into `.csv` format. Would you complete last part of your answer based on uploaded file so that I can understand how this conversion precure in your offred answer works? I tried `data['dd48'].to_csv('data.csv')` in the end of your snippet but I've got this error `KeyError: "Unable to open object (object 'd' doesn't exist)"` – Mario May 22 '19 at 14:01
  • Mario, I modiefed the code in my initial post to reflect what I found in your `data.h5` file. The datasets shown above are in a group at the root named `'dd48'`. – kcw78 May 23 '19 at 04:41
  • @Mario - did you look at the modifications and csv creation with `np.savetxt()`? Was it helpful? – kcw78 May 24 '19 at 15:07
  • Oh, yes but I found this very short code: `import pandas as pd from pandas import HDFStore store = pd.HDFStore('data.h5') store['dd48'].to_csv('data.csv')` and it worked perfectly and I was wondering what is the advantage of your offered answer rather to that. plz check it out and let me know if we can get same result like that by updating it. – Mario May 24 '19 at 17:10
  • 1
    @Mario Use the simple solution. :) My code is appropriate if you want to work specifically with numpy arrays. Probably overkill for your scenario. – kcw78 May 24 '19 at 20:53