Using AWS S3 and Apache Spark with hdf5/netcdf-4 data

Question

I've got a bunch of atmospheric data stored in AWS S3 that I want to analyze with Apache Spark, but am having a lot of trouble getting it loaded and into an RDD. I've been able to find examples online to help with discrete aspects of the problem:

-using h5py to read locally stored scientific data files via h5py.File(filename) (https://hdfgroup.org/wp/2015/03/from-hdf5-datasets-to-apache-spark-rdds/)

-boto/boto3 to get data that is textfile format from S3 into Spark via get_contents_as_string()

-map a set of text files to an RDD via keys.flatMap(mapFunc)

But I can't seem to get these parts to work together. Specifically-- how do you load in a netcdf file from s3 (using boto or directly, not attached to using boto) in order to then use h5py? Or can you treat the netcdf file as a binary file and load it in as a binary file and map to an rdd using sc.BinaryFile(binaryFile)?

Here's a couple of similar questions that weren't answered fully that relate:

How to read binary file on S3 using boto?

using pyspark, read/write 2D images on hadoop file system

_can you treat the netcdf file as a binary file and load it in as a binary file_ - as far as I know the answer is negative. `hpy5` uses C client directly and doesn't support buffers (`BytesIO`). S3Fs hassle-free `get` which can be used to copy from S3 to a local file system and then accesses with `h5py`. — zero323, Apr 04 '17 at 04:34

score 1 · Answer 1 · edited Oct 12 '21 at 18:05

1

Using the netCDF4 and s3fs modules, you can do:

from netCDF4 import Dataset
import s3fs
s3 = s3fs.S3FileSystem()

filename = 's3://bucket/a_file.nc'
with s3.open(filename, 'rb') as f:
    nc_bytes = f.read()

root = Dataset(f'inmemory.nc', memory=nc_bytes)

Make sure you are setup to read from S3. For details, here is the documentation.

edited Oct 12 '21 at 18:05

TomDLT

4,346
1
20
26

answered Mar 27 '20 at 11:47

VinceP

2,058
2
19
29

Using AWS S3 and Apache Spark with hdf5/netcdf-4 data

1 Answers1