Am trying to unzip a file(.zip) from s3,
I've tried the below method
config_dict = {"fs.s3n.awsAccessKeyId":AWS_KEY,
"fs.s3n.awsSecretAccessKey":AWS_SECRET}
print filename
rdd = sc.hadoopFile(filename,
'org.apache.hadoop.mapred.TextInputFormat',
'org.apache.hadoop.io.Text',
'org.apache.hadoop.io.LongWritable',
conf=config_dict)
which results an exception
Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.hadoopFile.
: java.io.IOException: No FileSystem for scheme: s3n
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2584)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2591)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2630)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2612)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370)
at org.apache...
I've also tried connecting using Boto
aws_connection = S3Connection(AWS_KEY, AWS_SECRET)
bucket = aws_connection.get_bucket('myBucket')
and unzipped using GZIP:
ip = gzip.GzipFile(fileobj=(StringIO(key.get_contents_as_string())))
myrdd = sc.textfile(ip.read())
This is not giving me desired result.
If I feed same zip file from local machine to my spark program like below, contents are being read properly
myrdd = sc.textfile(<my zipped file>)
Can someone give me an idea of how to read zipped file from s3 into spark RDD.
Thanks in advance