-1

I am trying to download a dataset of files in the HDF5 format. All the files are located on a HDFS that I set up. I want to use Spark to download the files and then somehow convert them. I havent figured out how to convert the HDF5 files into something usable/readable. Is it possible to convert them into a dataframe and then work on it with pandas?

Any help is appreciated. Thanks in advance

I have tried to read some documentation about wrapper classes etc. but I am pretty new to programming and a bit lost. I worked with csv files before and that worked flawlessly to download them from the HDFS using spark and then running panda commands on the dataframe, but I am struggling with the HDF5 format.

Pollux05
  • 29
  • 1
  • 3
  • [h5py](https://docs.h5py.org/en/stable/quick.html#quick) exposes the data as numpy arrays. Going from there to dataframes is trivial. Also, what's wrong with [pandas own support for HDF5](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_hdf.html)? – Homer512 Mar 12 '23 at 13:28
  • No need to convert HDF5 to pandas dataframes -- just read the HDF5 files as-is. The hardest part is figuring out the schema. Once you know that, it's a snap to access the data. Also, you can use HDFView from The HDF Group to view the data in a GUI. – kcw78 Mar 12 '23 at 15:19
  • @Homer512 the files are located on a HDFS and pandas own mehtods only support local file systems – Pollux05 Mar 13 '23 at 00:32
  • @kcw78 thank you for the idea, the HDFView is helpful. I do know what the schema of the dataset looks like. But I am still confused on how to "read the HDF5 files as-is". They are located on a hdfs and I am trying to get them using spark. What would be a command to read them as is? – Pollux05 Mar 13 '23 at 00:48

1 Answers1

0

This is somewhat speculative since I don't have an HDFS file system to test it.

However, from what I can gather, you can open HDFS files as file-like objects using Pydoop

Using h5py, you can read HDF5 files from file-like objects using the driver="fileobj" argument. That means, this should work, in theory:

from pydoop import hdfs
import h5py

with hdfs.open('/user/myuser/filename') as f:
    with h5py.File(f, driver='fileobj') as h:
        dataset = h['/group/dataset']
        content = dataset[:]

There also seems to be an HDFS driver for the HDF5 library but compiling that and getting it to work with h5py or Pandas HDFStore might be challenging.

Homer512
  • 9,144
  • 2
  • 8
  • 25