Read CSV file from AWS S3

Question

I have an EC2 instance running pyspark and I'm able to connect to it (ssh) and run interactive code within a Jupyter Notebook.

I have a S3 bucket with a csv file that I want to read, when I attempt to read it with:

spark = SparkSession.builder.appName('Basics').getOrCreate()
df = spark.read.csv('https://s3.us-east-2.amazonaws.com/bucketname/filename.csv')

Which throws a long Python error message and then something related to:

Py4JJavaError: An error occurred while calling o131.csv.

score 1 · Answer 1 · answered Mar 17 '19 at 16:03

1

Specify S3 path along with access key and secret key as following:

's3n://<AWS_ACCESS_KEY_ID>:<AWS_SECRET_ACCESS_KEY>@my.bucket/folder/input_data.csv'

answered Mar 17 '19 at 16:03

Anurag Sharma

2,409
2
16
34

Changing it does not fix the issue. Getting the same error, if I move my file to the EC2 directory I'm able to read it without passing users and pwd information – EGM8686 Mar 18 '19 at 02:52

score 0 · Answer 2 · answered Mar 17 '19 at 17:06

0

Access key-related information can be introduced in the typical username + password manner for URLs. As a rule, the access protocol should be s3a, the successor to s3n (see Technically what is the difference between s3n, s3a and s3?). Putting this together, you get

spark.read.csv("s3a://<AWS_ACCESS_KEY_ID>:<AWS_SECRET_ACCESS_KEY>@bucketname/filename.csv")

As an aside, some Spark execution environments, e.g., Databricks, allow S3 buckets to be mounted as part of the file system. You can do the same when you build a cluster using something like s3fs.

answered Mar 17 '19 at 17:06

Sim

13,147
9
66
95

Changing it does not fix the issue. Getting the same error, if I move my file to the EC2 directory I'm able to read it without passing users and pwd information – EGM8686 Mar 18 '19 at 03:16
Sure, an EC2 volume typically has a normal file system so you can read directly from it. It's rather difficult to diagnose your issue without a ton more information, e.g., roles, permissions, complete code, full stack traces, etc. The answer above is how Spark accesses S3 without mounting. If it's not working for you, it's a configuration/permissions issue. – Sim Mar 19 '19 at 03:03

Read CSV file from AWS S3

2 Answers2