Pyspark cannot retrieve data from AWS S3

Question

I am getting the following error:

Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: java.io.IOException: No FileSystem for scheme: s3n ...

When I try to retrieve data from S3. My spark-defaults.conf has the following line:

spark.jars      /Users/lrezende/Desktop/hadoop-aws-2.9.0.jar

And this file is in my Desktop.

My code is:

from pyspark.sql import SparkSession
if spark:
    spark.stop()

spark = SparkSession\
        .builder\
        .master("<master-address>")\
        .appName("Test")\
        .getOrCreate()

spark.sparkContext.setLogLevel('ERROR')
lines = spark.sparkContext.textFile("s3n://bucket/something/2017/*")
lines.collect()

When I run de lines.collect() I get the error.

Could someone help me to fix it?

Related to answer: https://stackoverflow.com/questions/33356041/technically-what-is-the-difference-between-s3n-s3a-and-s3 — OneCricketeer, Feb 23 '18 at 23:36

score 1 · Answer 1 · answered Feb 23 '18 at 23:28

1

If you are using a new(ish) version of Spark -- and transitively, Hadoop -- you need to use the s3a instead of the s3n URI scheme.

answered Feb 23 '18 at 23:28

ktdrv

3,602
3
30
45

score 0 · Answer 2 · answered Feb 26 '18 at 18:58

After all my "problem" was quite simples to solve. I had already put the following line in my spark-defaults.conf:

spark.jars.packages com.amazonaws:aws-java-sdk:1.10.34,org.apache.hadoop:hadoop-aws:2.9.0

And every time I reimported all the libs in my Jupyter Notebook, but what I haven't tried was to restart Jupyter service, which is still a little bit confusing because after fixing it every time I create a session spark-defaults.conf is read and it tries to download the needed packages. Why it didn't it before?

Anyway, thanks everyone for the time.

Pyspark cannot retrieve data from AWS S3

2 Answers2