Read files from S3 - Pyspark

Question

I have been looking for a clear answer to this question all morning but couldn't find anything understandable. I just started to use pyspark (installed with pip) a bit ago and have a simple .py file reading data from local storage, doing some processing and writing results locally. I'm currently running it using : python my_file.py

What I'm trying to do : Use files from AWS S3 as the input , write results to a bucket on AWS3

I am able to create a bucket an load files using "boto3" but saw some options using "spark.read.csv", which I want to use.

What I have tried : I tried to set up the credentials with :

spark = SparkSession.builder \
            .appName("my_app") \
            .config('spark.sql.codegen.wholeStage', False) \
            .getOrCreate()\

spark._jsc.hadoopConfiguration().set("fs.s3a.awsAccessKeyId", "my_key_id")
spark._jsc.hadoopConfiguration().set("fs.s3a.awsSecretAccessKey", "my_secret_key")

then :

df = spark.read.option("delimiter", ",").csv("s3a://bucket/key/filename.csv", header = True)

But get the error :

java.io.IOException: No FileSystem for scheme: s3a

Questions :

Do I need to install something in particular to make pyspark S3 enable ?
Should I somehow package my code and run a special command using the pyspark console ?

Thank you all, sorry for the duplicated issue

SOLVED :

The solution is the following :

To link a local spark instance to S3, you must add the jar files of aws-sdk and hadoop-sdk to your classpath and run your app with : spark-submit --jars my_jars.jar

Be carefull with the version you use for the SDKs, not all of them are compatible : aws-java-sdk-1.7.4, hadoop-aws-2.7.4 worked for me.

The configuration I used is :

spark = SparkSession.builder \
            .appName("my_app") \
            .config('spark.sql.codegen.wholeStage', False) \
            .getOrCreate()

spark._jsc.hadoopConfiguration().set("fs.s3a.access.key", "mykey")
spark._jsc.hadoopConfiguration().set("fs.s3a.secret.key", "mysecret")
spark._jsc.hadoopConfiguration().set("fs.s3a.impl","org.apache.hadoop.fs.s3a.S3AFileSystem")
spark._jsc.hadoopConfiguration().set("com.amazonaws.services.s3.enableV4", "true")
spark._jsc.hadoopConfiguration().set("fs.s3a.aws.credentials.provider","org.apache.hadoop.fs.s3a.BasicAWSCredentialsProvider")
spark._jsc.hadoopConfiguration().set("fs.s3a.endpoint", "eu-west-3.amazonaws.com")

You also have https://stackoverflow.com/questions/45968326/accessing-s3-using-s3a-protocol-from-spark-using-hadoop-version-2-7-2 — eliasah, Mar 12 '19 at 10:48
Thanks for your answer, I have looked at the issues you pointed out, but none correspond to my question. I think I don't run my applications the right way, which might be the real problem. — César Bouyssi, Mar 12 '19 at 11:14
then you should take look at this https://spark.apache.org/docs/latest/submitting-applications.html — eliasah, Mar 12 '19 at 13:02
**TL; DR:** df = spark.read.csv("s3://bucket/key/filename.csv"). If you want to read in with column names that you provide, on the next line you can then do `df = df.toDF(*column_names)`, for example `df = df.toDF(*my_pandas_df.columns)` — Nic Scozzaro, Jun 08 '20 at 05:56

Read files from S3 - Pyspark

0 Answers0

Linked