Reading HDF5 files

Question

Is there a way to read HDF5 files using the Scala version of Spark?

It looks like it can be done in Python (via Pyspark), but I can't find anything for Scala.

This might help you get you started. http://hdfgroup.org/wp/2015/03/from-hdf5-datasets-to-apache-spark-rdds/ — Mihajlo Eadric, Mar 17 '15 at 18:13

score 6 · Accepted Answer · edited Apr 02 '19 at 22:14

There isn't a Hadoop InputFormatimplementation for HDF5 because it is not capable of being arbitrarily split:

Breaking the container into blocks is a bit like taking an axe and chopping it to pieces, severing blindly the content and the smart wiring in the process. The result is a mess, because there's no alignment or correlation between HDFS block boundaries and the internal HDF5 cargo layout or container support structure. Reference

The same site discusses the possibility of transforming HDF5 files to Avro files, thus enabling them to be read by Hadoop/Spark, but the PySpark example you alluded to is probably a simpler way to go, but as the linked document mentions, there are a number of technical challenges that need to be addressed to efficiently and effectively work with HDF5 documents in Hadoop/Spark.

HDFEOS.org · Answer 2 · 2019-04-03T02:13:49.763

There's a new product that can talk to HDF5 from Apache Spark via Scala:

https://www.hdfgroup.org/downloads/hdf5-enterprise-support/hdf5-connector-for-apache-spark/

With the above product, you can open and read HDF5 like below in Scala:

//
// HOW TO RUN:
//
// $spark-2.3.0-SNAPSHOT-bin-hdf5s-0.0.1/bin/spark-shell -i demo.scala

import org.hdfgroup.spark.hdf5._
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder().appName("Spark SQL HDF5 example").getOrCreate()

// We assume that HDF5 files (e.g., GSSTF_NCEP.3.2008.12.31.he5) are 
// under /tmp directory. Change the path name ('/tmp') if necessary.
val df=spark.read.option("extension", "he5").option("recursion", "false").hdf5("/tmp/", "/HDFEOS/GRIDS/NCEP/Data Fields/SST")

// Let's print some values from the dataset.
df.show()

// The output will look like below.
//
//+------+-----+------+
//|FileID|Index| Value|
//+------+-----+------+
//|     0|    0|-999.0|
//|     0|    1|-999.0|
//|     0|    2|-999.0|
//...

System.exit(0)

score 0 · Answer 3 · answered Dec 23 '17 at 23:39

The answer to this question has an example on how to read multiple hdf5 files (compressed as .tar.gz) from the Million Song Dataset and extract each file's features to end up with a Spark RDD where each element of the RDD is an Array of features of each hdf5 file.

Reading HDF5 files

3 Answers3