How to specify AWS Access Key ID and Secret Access Key as part of a amazon s3n URL

Question

I am passing input and output folders as parameters to mapreduce word count program from webpage.

Getting below error:

HTTP Status 500 - Request processing failed; nested exception is java.lang.IllegalArgumentException: AWS Access Key ID and Secret Access Key must be specified as the username or password (respectively) of a s3n URL, or by setting the fs.s3n.awsAccessKeyId or fs.s3n.awsSecretAccessKey properties (respectively).

score 39 · Answer 1 · answered Jul 24 '14 at 12:32

39

The documentation has the format: http://wiki.apache.org/hadoop/AmazonS3

 s3n://ID:SECRET@BUCKET/Path

answered Jul 24 '14 at 12:32

RickH

2,416
16
17

12

Unfortunately this does not work if it happens that the secret has a "/" in it. Which is quite frequent. It's an old known bug https://issues.apache.org/jira/browse/HADOOP-3733, and may be fixed in hadoop 2.8 for s3a protocol. https://issues.apache.org/jira/browse/HADOOP-11573. The alternative is to put the keys in conf (but this has other caveats too) – mathieu Sep 17 '15 at 13:20
2

It worked for emr-4.3.0. Emr-4.4.0 and emr-4,5,0 throw `java.lang.IllegalArgumentException: Bucket name must not be formatted as an IP Address`, as if the ID and the SECRET were part of the bucket name. Emr-4.6.0 throws `java.lang.IllegalArgumentException: Bucket name should be between 3 and 63 characters long`. Any ideas? – Sergey Orshanskiy Jun 04 '16 at 00:03
s3n is not supported anymore – TriCore Oct 29 '17 at 03:17

score 10 · Answer 2 · answered Mar 07 '16 at 15:09

10

I suggest you use this:

hadoop distcp \
-Dfs.s3n.awsAccessKeyId=<your_access_id> \ 
-Dfs.s3n.awsSecretAccessKey=<your_access_key> \
s3n://origin hdfs://destinations

It also works as a workaround for the occurrence of slashes in the key. The parameters with the id and access key must be supplied exactly in this order: after disctcp and before origin

answered Mar 07 '16 at 15:09

Ricardo Teixeira

162
1
8

s3n is not supported anymore – TriCore Oct 29 '17 at 03:17

score 8 · Answer 3 · answered May 18 '16 at 18:04

8

Passing in the AWS Credentials as part of the Amazon s3n url is not normally recommended, security wise. Especially if that code is pushed to a repository holding service (like github). Ideally set your credentials in the conf/core-site.xml as:

<configuration>
  <property>
    <name>fs.s3n.awsAccessKeyId</name>
    <value>XXXXXX</value>
  </property>

  <property>
    <name>fs.s3n.awsSecretAccessKey</name>
    <value>XXXXXX</value>
  </property>
</configuration>

or reinstall awscli on your machine.

pip install awscli

answered May 18 '16 at 18:04

dyltini

543
7
11

Where to add the `` data? My pom.xml doen't seem to like it. I'm running a Spark job on a CentOS VM, and installing and configuring AWS CLI also didn't help. – lte__ Sep 07 '16 at 08:39
add it in this file: `conf/core-site.xml` – dyltini Sep 08 '16 at 11:45
2

What and where is this `conf/core-site.xml`? – lte__ Sep 08 '16 at 13:54
2

what if there are different s3 accounts requiring different keys? – horatio1701d Sep 12 '16 at 10:54
@prometheus2305 Unfortunately I was not able to solve that problem. – dyltini Sep 14 '16 at 14:03

score 2 · Answer 4 · answered Feb 01 '19 at 07:11

For pyspark beginner:

Prepare

Download jar from https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-aws
, put this to spark jars folder

Then you can

1. Hadoop config file

core-site.xml

export AWS_ACCESS_KEY_ID=<access-key>
export AWS_SECRET_ACCESS_KEY=<secret-key>

<configuration>
  <property>
    <name>fs.s3n.impl</name>
    <value>org.apache.hadoop.fs.s3native.NativeS3FileSystem</value>
  </property>

  <property>
    <name>fs.s3a.impl</name>
    <value>org.apache.hadoop.fs.s3a.S3AFileSystem</value>
  </property>

  <property>
    <name>fs.s3.impl</name>
    <value>org.apache.hadoop.fs.s3.S3FileSystem</value>
  </property>
</configuration>

2. pyspark config

sc._jsc.hadoopConfiguration().set("fs.s3.awsAccessKeyId", access_key)
sc._jsc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", access_key)
sc._jsc.hadoopConfiguration().set("fs.s3a.access.key", access_key)
sc._jsc.hadoopConfiguration().set("fs.s3.awsSecretAccessKey", secret_key)
sc._jsc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", secret_key)
sc._jsc.hadoopConfiguration().set("fs.s3a.secret.key", secret_key)
sc._jsc.hadoopConfiguration().set("fs.s3n.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
sc._jsc.hadoopConfiguration().set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
sc._jsc.hadoopConfiguration().set("fs.s3.impl", "org.apache.hadoop.fs.s3.S3FileSystem")

Example

import sys
from random import random
from operator import add

from pyspark.sql import SparkSession
from pyspark.conf import SparkConf


if __name__ == "__main__":
    """
        Usage: S3 sample
    """
    access_key = '<access-key>'
    secret_key = '<secret-key>'

    spark = SparkSession\
        .builder\
        .appName("Demo")\
        .getOrCreate()

    sc = spark.sparkContext

    # remove this block if use core-site.xml and env variable
    sc._jsc.hadoopConfiguration().set("fs.s3.awsAccessKeyId", access_key)
    sc._jsc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", access_key)
    sc._jsc.hadoopConfiguration().set("fs.s3a.access.key", access_key)
    sc._jsc.hadoopConfiguration().set("fs.s3.awsSecretAccessKey", secret_key)
    sc._jsc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", secret_key)
    sc._jsc.hadoopConfiguration().set("fs.s3a.secret.key", secret_key)
    sc._jsc.hadoopConfiguration().set("fs.s3n.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
    sc._jsc.hadoopConfiguration().set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
    sc._jsc.hadoopConfiguration().set("fs.s3.impl", "org.apache.hadoop.fs.s3.S3FileSystem")

    # fetch from s3, returns RDD
    csv_rdd = spark.sparkContext.textFile("s3n://<bucket-name>/path/to/file.csv")
    c = csv_rdd.count()
    print("~~~~~~~~~~~~~~~~~~~~~count~~~~~~~~~~~~~~~~~~~~~")
    print(c)

    spark.stop()

Oleksandr Tsurika · Answer 5 · 2019-11-07T20:38:45.860

create file core-site.xml and put it in class path. In the file specify

<configuration>
    <property>
        <name>fs.s3.awsAccessKeyId</name>
        <value>your aws access key id</value>
        <description>
            aws s3 key id
        </description>
    </property>

    <property>
        <name>fs.s3.awsSecretAccessKey</name>
        <value>your aws access key</value>
        <description>
            aws s3 key
        </description>
    </property>
</configuration>

Hadoop by default specifies two resources, loaded in-order from the classpath:

core-default.xml: Read-only defaults for hadoop
core-site.xml: Site-specific configuration for a given hadoop installation

score 0 · Answer 6 · answered Jul 19 '21 at 16:49

0

Change s3 to s3n in the s3 URI

answered Jul 19 '21 at 16:49

Rahul Sharma

1
2

2

Please try to give proper explanation of the answer. – Tyler2P Jul 19 '21 at 17:21

score 0 · Answer 7 · answered Nov 20 '21 at 13:00

0

hadoop distcp \
  -Dfs.s3a.access.key=<....> \
  -Dfs.s3a.secret.key=<....> \
  -Dfs.s3a.fast.upload=true \
  -update \
  s3a://path to file/ hdfs:///path/

answered Nov 20 '21 at 13:00

Rajshree

1

2

Your answer could be improved with additional supporting information. Please [edit] to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Nov 20 '21 at 13:25

How to specify AWS Access Key ID and Secret Access Key as part of a amazon s3n URL

7 Answers7

Prepare

1. Hadoop config file

2. pyspark config

Example

Linked