Problems with Hadoop distcp from HDFS to Amazon S3

Question

I am trying to move data from HDFS to S3 using distcp. The distcp job seems to succeed, but on S3 the files are not being created correctly. There are two issues:

The file names and paths are not replicated. All files end up as block_<some number> at the root of the bucket.
It creates bunch of extra files on S3 with some meta data and logs.

I could not find any documentation/examples for this. What am I missing? How can I debug?

Here are some more details:

$ hadoop version 
Hadoop 0.20.2-cdh3u0
Subversion  -r 
Compiled by diego on Sun May  1 15:42:11 PDT 2011
From source with checksum 
hadoop fs –ls hdfs://hadoopmaster/data/paramesh/
…<bunch of files>…

hadoop distcp  hdfs://hadoopmaster/data/paramesh/ s3://<id>:<key>@paramesh-test/
$ ./s3cmd-1.1.0-beta3/s3cmd ls s3://paramesh-test

                       DIR   s3://paramesh-test//
                       DIR   s3://paramesh-test/test/
2012-05-10 02:20         0   s3://paramesh-test/block_-1067032400066050484
2012-05-10 02:20      8953   s3://paramesh-test/block_-183772151151054731
2012-05-10 02:20     11209   s3://paramesh-test/block_-2049242382445148749
2012-05-10 01:40      1916   s3://paramesh-test/block_-5404926129840434651
2012-05-10 01:40      8953   s3://paramesh-test/block_-6515202635859543492
2012-05-10 02:20     48051   s3://paramesh-test/block_1132982570595970987
2012-05-10 01:40     48052   s3://paramesh-test/block_3632190765594848890
2012-05-10 02:20      1160   s3://paramesh-test/block_363439138801598558
2012-05-10 01:40      1160   s3://paramesh-test/block_3786390805575657892
2012-05-10 01:40     11876   s3://paramesh-test/block_4393980661686993969

score 15 · Answer 1 · answered May 10 '12 at 23:19

15

You should use s3n instead of s3.

s3n is the native file system implementation (ie - regular files), using s3 imposes hdfs block structure on the files so you can't really read them without going through hdfs libraries.

Thus:

hadoop distcp hdfs://file/1 s3n://bucket/destination

answered May 10 '12 at 23:19

Matthew Rathbone

8,144
7
49
79

Note that if using AWS EMR, s3n and s3 apparently point to same path (only if using AWS's EMR -- I know author didn't mention, but figured others might get confused). source: http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/FileSystemConfig.html – Dolan Antenucci Jun 24 '12 at 20:46
1

Typically the hdfs is an absolute path, starting with a slash: hdfs:///file/1 – ramn Oct 09 '12 at 10:42
And what if the file was bigger than 5GB? Since s3n is limited to 5GB as native file system. S3// is for files bigger than 5GB even though it won't let you use it with other applications. Am I right or there is a way for example make an external table out of the file bigger than 5GB on S3? Mine works if it's s3n and it's less than 5GB, otherwise it gives me weird outcome. – Maziyar Nov 23 '13 at 06:25
as stated here: https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html you can set -Dfs.s3n.multipart.uploads.enabled=true to upload files larger than 5GB, or you can use the third generation s3a filesystem – Sean Feb 22 '16 at 20:22

score 3 · Answer 2 · answered May 25 '12 at 06:34

Amazon has created a version of distcp that is optimized for transferring between hdfs and s3 which they call, appropriately, s3distcp. You may want to check that out as well. It is intended for use with Amazon EMR, but the jar is available in s3, so you might be able to use it outside of an EMR job flow.

http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/UsingEMR_s3distcp.html

Sean · Answer 3 · 2016-02-22T20:42:01.893

In the event that your files in HDFS are larger than 5GB, you will encounter errors in your distcp job that look like:

Caused by: org.jets3t.service.S3ServiceException: S3 Error Message. -- ResponseCode: 400, ResponseStatus: Bad Request, XML Error Message: <?xml version="1.0" encoding="UTF-8"?><Error><Code>EntityTooLarge</Code><Message>Your proposed upload exceeds the maximum allowed size</Message><ProposedSize>23472570134</ProposedSize><MaxSizeAllowed>5368709120</MaxSizeAllowed><RequestId>5BDA6B12B9E99CE9</RequestId><HostId>vmWvS3Ynp35bpIi7IjB7mv1waJSBu5gfrqF9U2JzUYsXg0L7/sy42liEO4m0+lh8V6CqU7PU2uo=</HostId></Error> at org.jets3t.service.S3Service.putObject(S3Service.java:2267) at org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.storeFile(Jets3tNativeFileSystemStore.java:122) ... 27 more Container killed by the ApplicationMaster. Container killed on request. Exit code is 143 Container exited with a non-zero exit code 143

To fix this, use either the s3n filesystem as @matthew-rathbone suggested, but with -Dfs.s3n.multipart.uploads.enabled=true like:

hadoop distcp -Dfs.s3n.multipart.uploads.enabled=true hdfs://file/1 s3n://bucket/destination

OR

Use the "next generation" s3 filesystem, s3a like:

hadoop distcp -Dfs.s3a.endpoint=apigateway.us-east-1.amazonaws.com hdfs://file/1 s3a://bucket/destination

Options and documentation for these live here: https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html

For cloudera users, 5.2 does not support the `s3a` protocol yet, so option `s3n` with the multipart option is needed. Version 5.5 and later support newer `s3a`. — Tom Harrison, Aug 02 '16 at 14:47

score 2 · Answer 4 · answered Sep 02 '16 at 12:50

Updating this for Apache Hadoop 2.7+, and ignoring Amazon EMR as they've changed things there.

If you are using Hadoop 2.7 or later, use s3a over s3n. This also applies to recent versions of HDP and, AFAIK, CDH.
This supports 5+GB files, has other nice features, etc. It is tangibly better when reading files —and will only get better over time.
Apache s3:// should be considered deprecated -you don't need it any more, and shouldn't be using it.
Amazon EMR use "s3://" to refer to their own, custom, binding to S3. That's what you should be using if you are running on EMR.

Improving distcp reliability and performance working with object stores is still and ongoing piece of work...contributions are, as always, welcome.

Problems with Hadoop distcp from HDFS to Amazon S3

4 Answers4

Linked