Issue while connecting Amazon s3 using pySpark

Question

I am using Spark 1.6 version local mode. following is my code :

First Attempt:

airline = sc.textFile("s3n://mortar-example-data/airline-data")
airline.take(2)

Second Attempt:

airline = sc.textFile("s3n://myid:mykey@mortar-example-data/airline-data")
airline.take(2)

the above code is throwing me following error:

Py4JJavaError: An error occurred while calling o17.partitions.
: java.io.IOException: No FileSystem for scheme: s3n
    at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2584)
    at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2591)
    at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)

Not sure what is missing here to connect to S3. It will be great if someone could point me out

Possible dupe of http://stackoverflow.com/questions/29443911/locally-reading-s3-files-through-spark-or-better-pyspark — GarMan, Feb 23 '16 at 16:25

score 0 · Answer 1 · answered Sep 13 '16 at 19:13

@John

Following is my solution

bucket = "your bucket"

# Prod App Key
prefix = "Your path to the file"
filename = "s3n://{}/{}".format(bucket, prefix)
sc._jsc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", "YourAccessKey")
sc._jsc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", "YourSecret key") 

rdd = sc.hadoopFile(filename,
                    'org.apache.hadoop.mapred.TextInputFormat',
                    'org.apache.hadoop.io.Text',
                    'org.apache.hadoop.io.LongWritable',
                    )
rdd.count()

The above code worked for me... Good luck.

Issue while connecting Amazon s3 using pySpark

1 Answers1