0

I have an EC2 instance running pyspark and I'm able to connect to it (ssh) and run interactive code within a Jupyter Notebook.

I have a S3 bucket with a csv file that I want to read, when I attempt to read it with:

spark = SparkSession.builder.appName('Basics').getOrCreate()
df = spark.read.csv('https://s3.us-east-2.amazonaws.com/bucketname/filename.csv')

Which throws a long Python error message and then something related to:

Py4JJavaError: An error occurred while calling o131.csv.

Anurag Sharma
  • 2,409
  • 2
  • 16
  • 34
EGM8686
  • 1,492
  • 1
  • 11
  • 22

2 Answers2

1

Specify S3 path along with access key and secret key as following:

's3n://<AWS_ACCESS_KEY_ID>:<AWS_SECRET_ACCESS_KEY>@my.bucket/folder/input_data.csv'
Anurag Sharma
  • 2,409
  • 2
  • 16
  • 34
  • Changing it does not fix the issue. Getting the same error, if I move my file to the EC2 directory I'm able to read it without passing users and pwd information – EGM8686 Mar 18 '19 at 02:52
0

Access key-related information can be introduced in the typical username + password manner for URLs. As a rule, the access protocol should be s3a, the successor to s3n (see Technically what is the difference between s3n, s3a and s3?). Putting this together, you get

spark.read.csv("s3a://<AWS_ACCESS_KEY_ID>:<AWS_SECRET_ACCESS_KEY>@bucketname/filename.csv")

As an aside, some Spark execution environments, e.g., Databricks, allow S3 buckets to be mounted as part of the file system. You can do the same when you build a cluster using something like s3fs.

Sim
  • 13,147
  • 9
  • 66
  • 95
  • Changing it does not fix the issue. Getting the same error, if I move my file to the EC2 directory I'm able to read it without passing users and pwd information – EGM8686 Mar 18 '19 at 03:16
  • Sure, an EC2 volume typically has a normal file system so you can read directly from it. It's rather difficult to diagnose your issue without a ton more information, e.g., roles, permissions, complete code, full stack traces, etc. The answer above is how Spark accesses S3 without mounting. If it's not working for you, it's a configuration/permissions issue. – Sim Mar 19 '19 at 03:03