How to setup PySpark to locally read data from S3 using Hadoop?

Question

I followed this blog post which suggests using:

from pyspark import SparkConf
from pyspark.sql import SparkSession
 
conf = SparkConf()
conf.set('spark.jars.packages', 'org.apache.hadoop:hadoop-aws:3.2.0')
conf.set('spark.hadoop.fs.s3a.aws.credentials.provider', 'org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider')
conf.set('spark.hadoop.fs.s3a.access.key', <access_key>)
conf.set('spark.hadoop.fs.s3a.secret.key', <secret_key>)
conf.set('spark.hadoop.fs.s3a.session.token', <token>)
 
spark = SparkSession.builder.config(conf=conf).getOrCreate()

I used it to configure PySpark and it worked to get data from S3 directly from my local machine.

However I found this question about the use of s3a, s3n or s3 and one of the recent answers says advises against using s3a. Also I found this guide from AWS discouraging the use of s3a as well:

Previously, Amazon EMR used the s3n and s3a file systems. While both still work, we recommend that you use the s3 URI scheme for the best performance, security, and reliability.

So I decided to try to look for how to implement the use of s3 with PySpark and Hadoop, but I found this guide from Hadoop mentioning it only supports s3a oficially:

There other Hadoop connectors to S3. Only S3A is actively maintained by the Hadoop project itself.

The mentioned method from the blog post works, but is it the best option for this situation? Is there any other way to configure this?

What would be the best method to access S3 from a local machine?

Running locally is not using EMR, so use `s3a`, as the Hadoop-AWS docs say. Alternatively, just use `boto3` — OneCricketeer, Jan 31 '22 at 19:03
I don't see why not, but it's not really clear what you need to do with the data. Simply reading data from S3 does not require Spark — OneCricketeer, Jan 31 '22 at 22:18
Indeed simply reading data from S3 does not require Spark, however what I want is to read data into Spark directly to a processing pipeline. — Gustavo, Feb 01 '22 at 12:06
looking at the negative post you pointed at, his criticisms would have been valid for the hadoop 2.7 release, but as that was 2016 it's five years out of date now. everyone uses the s3a connector at scales you would have to go to significant effort to exceed — stevel, Feb 01 '22 at 13:20
btw, if you set your aws session env vars up, spark will pick them up automatically. never put secrets in code if you can avoid it. — stevel, Feb 01 '22 at 13:21

score 6 · Accepted Answer · answered Feb 01 '22 at 13:15

AWS docs about EMR. your local system is not EMR, so ignore it completely.

Use the ASF-developed s3a connector and look at the hadoop docs on how to use it, in preference to examples from out of date stack overflow posts. {i.e. if the docs say something contradictory to what a 4 y.o. post says, go with the docs. Or even the source)

How to setup PySpark to locally read data from S3 using Hadoop?

1 Answers1