Pyspark - read zip file from s3 to an RDD

Question

Am trying to unzip a file(.zip) from s3,
I've tried the below method

config_dict = {"fs.s3n.awsAccessKeyId":AWS_KEY,
               "fs.s3n.awsSecretAccessKey":AWS_SECRET}
print filename
rdd = sc.hadoopFile(filename,
                    'org.apache.hadoop.mapred.TextInputFormat',
                    'org.apache.hadoop.io.Text',
                    'org.apache.hadoop.io.LongWritable',
                    conf=config_dict)

which results an exception

Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.hadoopFile.
: java.io.IOException: No FileSystem for scheme: s3n
    at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2584)
    at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2591)
    at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)
    at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2630)
    at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2612)
    at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370)
    at org.apache...

I've also tried connecting using Boto

aws_connection = S3Connection(AWS_KEY, AWS_SECRET)
bucket = aws_connection.get_bucket('myBucket')

and unzipped using GZIP:

ip = gzip.GzipFile(fileobj=(StringIO(key.get_contents_as_string()))) 
myrdd = sc.textfile(ip.read())

This is not giving me desired result.

If I feed same zip file from local machine to my spark program like below, contents are being read properly

myrdd =  sc.textfile(<my zipped file>)

Can someone give me an idea of how to read zipped file from s3 into spark RDD.

Thanks in advance

This question looks similar. http://stackoverflow.com/questions/28569788/how-to-open-stream-zip-files-through-spark — Marie, Jun 15 '16 at 12:29
IMHO, you're probably better off unzipping the file outside of spark -- the "unzip" process can only happen on one node, which negates any advantages of Spark. — Marco, Jan 17 '19 at 10:32

Pyspark - read zip file from s3 to an RDD

0 Answers0