I've got a bunch of atmospheric data stored in AWS S3 that I want to analyze with Apache Spark, but am having a lot of trouble getting it loaded and into an RDD. I've been able to find examples online to help with discrete aspects of the problem:
-using h5py to read locally stored scientific data files via h5py.File(filename)
(https://hdfgroup.org/wp/2015/03/from-hdf5-datasets-to-apache-spark-rdds/)
-boto/boto3 to get data that is textfile format from S3 into Spark via get_contents_as_string()
-map a set of text files to an RDD via keys.flatMap(mapFunc)
But I can't seem to get these parts to work together. Specifically-- how do you load in a netcdf file from s3 (using boto or directly, not attached to using boto) in order to then use h5py? Or can you treat the netcdf file as a binary file and load it in as a binary file and map to an rdd using sc.BinaryFile(binaryFile)
?
Here's a couple of similar questions that weren't answered fully that relate: