Converting HDF5 files into readable/usable data after downloading from HDFS using Spark

Question

I am trying to download a dataset of files in the HDF5 format. All the files are located on a HDFS that I set up. I want to use Spark to download the files and then somehow convert them. I havent figured out how to convert the HDF5 files into something usable/readable. Is it possible to convert them into a dataframe and then work on it with pandas?

Any help is appreciated. Thanks in advance

I have tried to read some documentation about wrapper classes etc. but I am pretty new to programming and a bit lost. I worked with csv files before and that worked flawlessly to download them from the HDFS using spark and then running panda commands on the dataframe, but I am struggling with the HDF5 format.

[h5py](https://docs.h5py.org/en/stable/quick.html#quick) exposes the data as numpy arrays. Going from there to dataframes is trivial. Also, what's wrong with [pandas own support for HDF5](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_hdf.html)? — Homer512, Mar 12 '23 at 13:28
No need to convert HDF5 to pandas dataframes -- just read the HDF5 files as-is. The hardest part is figuring out the schema. Once you know that, it's a snap to access the data. Also, you can use HDFView from The HDF Group to view the data in a GUI. — kcw78, Mar 12 '23 at 15:19
@Homer512 the files are located on a HDFS and pandas own mehtods only support local file systems — Pollux05, Mar 13 '23 at 00:32
@kcw78 thank you for the idea, the HDFView is helpful. I do know what the schema of the dataset looks like. But I am still confused on how to "read the HDF5 files as-is". They are located on a hdfs and I am trying to get them using spark. What would be a command to read them as is? — Pollux05, Mar 13 '23 at 00:48

score 0 · Answer 1 · answered Mar 13 '23 at 19:19

This is somewhat speculative since I don't have an HDFS file system to test it.

However, from what I can gather, you can open HDFS files as file-like objects using Pydoop

Using h5py, you can read HDF5 files from file-like objects using the driver="fileobj" argument. That means, this should work, in theory:

from pydoop import hdfs
import h5py

with hdfs.open('/user/myuser/filename') as f:
    with h5py.File(f, driver='fileobj') as h:
        dataset = h['/group/dataset']
        content = dataset[:]

There also seems to be an HDFS driver for the HDF5 library but compiling that and getting it to work with h5py or Pandas HDFStore might be challenging.

Converting HDF5 files into readable/usable data after downloading from HDFS using Spark

1 Answers1