How to configure pyspark to access AWS S3 containers?

Question

I just started learning to use spark and AWS. I have configured my spark session as follows:

spark = SparkSession.builder\
                     .config("spark.jars.packages", "org.apache.hadoop:hadoop-aws:3.2.0") \
                     .config("spark.master", "local") \
                     .config("spark.app.name", "S3app") \
                     .getOrCreate()
spark._jsc.hadoopConfiguration().set("fs.s3a.access.key", os.environ["AWS_ACCESS_KEY_ID"])
spark._jsc.hadoopConfiguration().set("fs.s3a.secret.key", os.environ["AWS_SECRET_ACCESS_KEY"])

I would like to read the data from S3a interface using pyspark as follows:

df = spark.read.csv("s3a://some_container/some_csv.csv")

But I keep getting java.lang.ClassNotFoundException:

2/07/11 00:48:01 WARN MetricsConfig: Cannot locate configuration: tried hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties
22/07/11 00:48:01 WARN FileSystem: Failed to initialize fileystem s3a://udacity-dend/sparkbyexamples/csv/zipcodes.csv: java.io.IOException: From option fs.s3a.aws.credentials.provider java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.auth.IAMInstanceCredentialsProvider not found

After googling, I think this is because I have not configured spark properly. How should I fix this?

https://docs.dominodatalab.com/en/4.6/user_guide/a3b42e/work-with-data/#_s3_usage_examples — Dan M, Jul 22 '22 at 05:35

score 1 · Answer 1 · answered Sep 30 '22 at 09:47

Those are the java packages needed (original guide):

hadoop-aws: (must be same as Hadoop version built with Spark, e.g. 3.3.1)
aws-java-sdk-bundle: (dependency of the above hadoop-aws)
hadoop-common: (must be same as Hadoop version built with Spark, e.g. 3.3.1)

If using PySpark shell, the packages can be included in the following manner:

pyspark --packages org.apache.hadoop:hadoop-aws:3.3.1,com.amazonaws:aws-java-sdk-bundle:1.11.901,org.apache.hadoop:hadoop-common:3.3.1 (...)

Then within the shell, declare:

# sc => spark context
sc._jsc.hadoopConfiguration().set("fs.s3a.access.key", "access_key")
sc._jsc.hadoopConfiguration().set("fs.s3a.secret.key", "secret_key")
sc._jsc.hadoopConfiguration().set("fs.s3a.endpoint", "endpoint_include_region")

We can start using S3 to, e.g. save DataFrame:

df.write.format("csv").option("header", "true").save("s3a://datasets/result")

The folder datasets already exists. After the command above, there will be a folder result created, within which we see:

The result running on a different setup (e.g. more data, more Executors) may vary.

My environment

Apache Spark 3.2.2 (built for Hadoop 3.3.1)
Python 3.10.4
openjdk 11.0.16.1 2022-08-12
Ubuntu 22.04 LTS

Soumya Naidu · Answer 2 · 2022-10-04T20:45:05.647

Go to maven to check you have the correct jars for the version of hadoop_aws jar that you have. Go to the following link https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-aws and once you pick the version of hadoop_aws jar that you have, make sure you have the correct version of compile dependencies, provided dependencies and runtime dependencies.

Also, when you read environment variables, you will need to add this to your code

spark.sparkContext._jsc.hadoopConfiguration().set("fs.s3a.aws.credentials.provider","com.amazonaws.auth.EnvironmentVariableCredentialsProvider")

How to configure pyspark to access AWS S3 containers?

2 Answers2

My environment