Locally reading S3 files through Spark (or better: pyspark)

Question

I want to read an S3 file from my (local) machine, through Spark (pyspark, really). Now, I keep getting authentication errors like

java.lang.IllegalArgumentException: AWS Access Key ID and Secret Access Key must be specified as the username or password (respectively) of a s3n URL, or by setting the fs.s3n.awsAccessKeyId or fs.s3n.awsSecretAccessKey properties (respectively).

I looked everywhere here and on the web, tried many things, but apparently S3 has been changing over the last year or months, and all methods failed but one:

pyspark.SparkContext().textFile("s3n://user:password@bucket/key")

(note the s3n [s3 did not work]). Now, I don't want to use a URL with the user and password because they can appear in logs, and I am also not sure how to get them from the ~/.aws/credentials file anyway.

So, how can I read locally from S3 through Spark (or, better, pyspark) using the AWS credentials from the now standard ~/.aws/credentials file (ideally, without copying the credentials there to yet another configuration file)?

PS: I tried os.environ["AWS_ACCESS_KEY_ID"] = … and os.environ["AWS_SECRET_ACCESS_KEY"] = …, it did not work.

PPS: I am not sure where to "set the fs.s3n.awsAccessKeyId or fs.s3n.awsSecretAccessKey properties" (Google did not come up with anything). However, I did try many ways of setting these: SparkContext.setSystemProperty(), sc.setLocalProperty(), and conf = SparkConf(); conf.set(…); conf.set(…); sc = SparkContext(conf=conf). Nothing worked.

Also see [this answer](http://stackoverflow.com/a/33787125/1243926). — Sergey Orshanskiy, Apr 12 '16 at 01:07
It worked for me to set the environment variables at the command line before running spark-submit with pyspark locally. Setting them inside of pyspark using `os.environ` didn't work because it's too late at that point to get picked up. — Andrew C, Jun 01 '18 at 17:50
Yeah, things should work this way (or through the correct configuration file). This question arose only because of a bug on boto. — Eric O. Lebigot, Jun 03 '18 at 06:22

score 10 · Answer 1 · answered Apr 04 '15 at 10:21

10

Yes, you have to use s3n instead of s3. s3 is some weird abuse of S3 the benefits of which are unclear to me.

You can pass the credentials to the sc.hadoopFile or sc.newAPIHadoopFile calls:

rdd = sc.hadoopFile('s3n://my_bucket/my_file', conf = {
  'fs.s3n.awsAccessKeyId': '...',
  'fs.s3n.awsSecretAccessKey': '...',
})

answered Apr 04 '15 at 10:21

Daniel Darabos

26,991
10
102
114

Thanks, that's informative. What is this `my_file` supposed to be? Just a place where the configuration file is stored? Could it be stored beforehand, then, and locally? Another of my questions was how to access programmatically the data from `~/.aws/credentials` (short of parsing it with `ConfigParser`): do you know how to do that? – Eric O. Lebigot Apr 04 '15 at 12:26
2

`my_file` is the file you're trying to read. Instead of passing the keys in the URL, you pass them through the `conf` parameter. As far as I know `~/.aws/credentials` is an implementation detail of `aws-cli`. You could parse it yourself, or put the keys in your own config file of your preferred format. (I see it's not a complete answer. Hope it's useful anyway!) – Daniel Darabos Apr 04 '15 at 13:50
For reference: While I have indeed seen repeatedly that `s3n` should be used in place of the "old" `s3` "block" filesystem, the current official documentation indicates that `s3` should be used: http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-plan-file-systems.html. – Eric O. Lebigot Apr 04 '15 at 23:21
1

Doesn't work for me (spark1.5, hadoop2.4). I have the error "AWS Access Key ID and Secret Access Key must be specified as the username or password (respectively) of a s3 URL, or by setting the fs.s3.awsAccessKeyId or fs.s3.awsSecretAccessKey properties (respectively)". And the S3 url is no use there, since when you have a "/" in the key, it does not work (HADOOP-3733) – mathieu Sep 17 '15 at 13:28
@mathieu: Sounds like you're using `s3://` instead of `s3n://`. – Daniel Darabos Sep 17 '15 at 13:45
I should do some testing and improve this answer. I do recall some version-specific issues as well. Try Spark's Hadoop 1.x build. You can also try `s3a` instead of `s3n`. It ought to be the much improved replacement of `s3n` from Hadoop 2.6 on. I haven't tried it yet, but https://www.appsflyer.com/blog/the-bleeding-edge-spark-parquet-and-s3/ describes some hoops you have jump through to get it working. – Daniel Darabos Nov 10 '15 at 10:37

Eric O. Lebigot · Accepted Answer · 2017-05-19T19:10:37.387

3

The problem was actually a bug in the Amazon's boto Python module. The problem was related to the fact that MacPort's version is actually old: installing boto through pip solved the problem: ~/.aws/credentials was correctly read.

Now that I have more experience, I would say that in general (as of the end of 2015) Amazon Web Services tools and Spark/PySpark have a patchy documentation and can have some serious bugs that are very easy to run into. For the first problem, I would recommend to first update the aws command line interface, boto and Spark every time something strange happens: this has "magically" solved a few issues already for me.

edited May 19 '17 at 19:10

answered Apr 23 '15 at 11:40

Eric O. Lebigot

91,433
48
218
260

1

I installed python 3.6 and broke awscli. I guess I originally installed it with 2.7 I then had to `pip install awscli` again in a Python 3 context. So that suggestion to always keep aws cli, boto and spark updated is good advice! – Davos May 19 '17 at 04:50

score 3 · Answer 3 · edited Jul 10 '18 at 20:54

Here is a solution on how to read the credentials from ~/.aws/credentials. It makes use of the fact that the credentials file is an INI file which can be parsed with Python's configparser.

import os
import configparser

config = configparser.ConfigParser()
config.read(os.path.expanduser("~/.aws/credentials"))

aws_profile = 'default' # your AWS profile to use

access_id = config.get(aws_profile, "aws_access_key_id") 
access_key = config.get(aws_profile, "aws_secret_access_key")

See also my gist at https://gist.github.com/asmaier/5768c7cda3620901440a62248614bbd0 .

score 1 · Answer 4 · answered Mar 30 '16 at 11:07

1

Environment variables setup could help.

Here in Spark FAQ under the question "How can I access data in S3?" they suggest to set AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY environment variables.

answered Mar 30 '16 at 11:07

Zeke Fast

500
5
12

Followed that link and that question doesn't seem to be there anymore. – Evan Zamir Sep 01 '16 at 17:51
@EvanZamir for Spark 2.0.0 this might help https://github.com/amplab/spark-ec2#accessing-data-in-s3 – Zeke Fast Sep 10 '16 at 00:02

score 0 · Answer 5 · edited May 23 '17 at 12:03

I cannot say much about the java objects you have to give to the hadoopFile function, only that this function already seems depricated for some "newAPIHadoopFile". The documentation on this is quite sketchy and I feel like you need to know Scala/Java to really get to the bottom of what everything means. In the mean time, I figured out how to actually get some s3 data into pyspark and I thought I would share my findings. This documentation: Spark API documentation says that it uses a dict that gets converted into a java configuration (XML). I found the configuration for java, this should probably reflect the values you should put into the dict: How to access S3/S3n from local hadoop installation

bucket = "mycompany-mydata-bucket"
prefix = "2015/04/04/mybiglogfile.log.gz"
filename = "s3n://{}/{}".format(bucket, prefix)

config_dict = {"fs.s3n.awsAccessKeyId":"FOOBAR",
               "fs.s3n.awsSecretAccessKey":"BARFOO"}

rdd = sc.hadoopFile(filename,
                    'org.apache.hadoop.mapred.TextInputFormat',
                    'org.apache.hadoop.io.Text',
                    'org.apache.hadoop.io.LongWritable',
                    conf=config_dict)

This code snippet loads the file from the bucket and prefix (file path in the bucket) specified on the first two lines.

Doesn't work for me (spark1.5, hadoop2.4). I have the error "AWS Access Key ID and Secret Access Key must be specified as the username or password (respectively) of a s3 URL, or by setting the fs.s3.awsAccessKeyId or fs.s3.awsSecretAccessKey properties (respectively)". And the S3 url is no use there, since when you have a "/" in the key, it does not work (HADOOP-3733) — mathieu, Sep 17 '15 at 13:30

Locally reading S3 files through Spark (or better: pyspark)

5 Answers5

Linked