6

I have a huge bucket of S3files that I want to put on HDFS. Given the amount of files involved my preferred solution is to use 'distributed copy'. However for some reason I can't get hadoop distcp to take my Amazon S3 credentials. The command I use is:

hadoop distcp -update s3a://[bucket]/[folder]/[filename] hdfs:///some/path/ -D fs.s3a.awsAccessKeyId=[keyid] -D fs.s3a.awsSecretAccessKey=[secretkey] -D fs.s3a.fast.upload=true

However that acts the same as if the '-D' arguments aren't there.

ERROR tools.DistCp: Exception encountered
java.io.InterruptedIOException: doesBucketExist on [bucket]: com.amazonaws.AmazonClientException: No AWS Credentials provided by BasicAWSCredentialsProvider EnvironmentVariableCredentialsProvider SharedInstanceProfileCredentialsProvider : com.amazonaws.SdkClientException: Unable to load credentials from service endpoint

I've looked at the hadoop distcp documentation, but can't find a solution there on why this isn't working. I've tried -Dfs.s3n.awsAccessKeyId as a flag which didn't work either. I've read how explicitly passing credentials isn't good practice, so maybe this is just some gentil suggestion to do it some other way?

How is one supposed to pass S3 credentials with distcp? Anyone knows?

KDC
  • 1,441
  • 5
  • 19
  • 36
  • You shouldn't use spaces after `-D`, but you also should not pass those via the command line anyway. Why aren't those in your core-site.xml or defined as environment variables? – OneCricketeer Nov 23 '17 at 13:33

3 Answers3

12

It appears the format of credentials flags has changed since the previous version. The following command works:

hadoop distcp \
  -Dfs.s3a.access.key=[accesskey] \
  -Dfs.s3a.secret.key=[secretkey] \
  -Dfs.s3a.fast.upload=true \
  -update \
  s3a://[bucket]/[folder]/[filename] hdfs:///some/path
philantrovert
  • 9,904
  • 3
  • 37
  • 61
KDC
  • 1,441
  • 5
  • 19
  • 36
  • What do you mean the format? `-D` is a standard Java flag, and spaces are taken as separate arguments – OneCricketeer Nov 23 '17 at 14:12
  • 1
    Never mind the -D flag, I got that from some bad documentation site and it was obviously a red herring. The format of the old flags was Dfs.s3n.awsAccessKeyId and Dfs.s3n.awsSecretAccessKey. Apparently now it's Dfs.s3a.access.key and Dfs.s3a.secret.key – KDC Nov 24 '17 at 09:52
0

In case if some one came for with same error using -D hadoop.security.credential.provider.path, please ensure your credentials store(jceks file ) is located in distributed file system(hdfs) as distcp starts form one of the node manager node so it can access the same.

0

Koen's answer helped me, here is my version.

hadoop distcp \
  -Dfs.s3a.aws.credentials.provider=org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider \
  -Dfs.s3a.access.key=[accesskey] \
  -Dfs.s3a.secret.key=[secretkey] \
  -Dfs.s3a.session.token=[sessiontoken] \
  -Dfs.s3a.fast.upload=true \
  hdfs:///some/path s3a://[bucket]/[folder]/[filename] 
Pavan_Obj
  • 1,071
  • 1
  • 12
  • 26