1

I've got a bunch of atmospheric data stored in AWS S3 that I want to analyze with Apache Spark, but am having a lot of trouble getting it loaded and into an RDD. I've been able to find examples online to help with discrete aspects of the problem:

-using h5py to read locally stored scientific data files via h5py.File(filename) (https://hdfgroup.org/wp/2015/03/from-hdf5-datasets-to-apache-spark-rdds/)

-boto/boto3 to get data that is textfile format from S3 into Spark via get_contents_as_string()

-map a set of text files to an RDD via keys.flatMap(mapFunc)

But I can't seem to get these parts to work together. Specifically-- how do you load in a netcdf file from s3 (using boto or directly, not attached to using boto) in order to then use h5py? Or can you treat the netcdf file as a binary file and load it in as a binary file and map to an rdd using sc.BinaryFile(binaryFile)?

Here's a couple of similar questions that weren't answered fully that relate:

How to read binary file on S3 using boto?

using pyspark, read/write 2D images on hadoop file system

Community
  • 1
  • 1
abe732
  • 125
  • 2
  • 12
  • 1
    _can you treat the netcdf file as a binary file and load it in as a binary file_ - as far as I know the answer is negative. `hpy5` uses C client directly and doesn't support buffers (`BytesIO`). S3Fs hassle-free `get` which can be used to copy from S3 to a local file system and then accesses with `h5py`. – zero323 Apr 04 '17 at 04:34
  • thanks @zero323, looking up the s3fs interface – abe732 Apr 06 '17 at 00:40

1 Answers1

1

Using the netCDF4 and s3fs modules, you can do:

from netCDF4 import Dataset
import s3fs
s3 = s3fs.S3FileSystem()

filename = 's3://bucket/a_file.nc'
with s3.open(filename, 'rb') as f:
    nc_bytes = f.read()

root = Dataset(f'inmemory.nc', memory=nc_bytes)

Make sure you are setup to read from S3. For details, here is the documentation.

TomDLT
  • 4,346
  • 1
  • 20
  • 26
VinceP
  • 2,058
  • 2
  • 19
  • 29