Hadoop distcp -possible to keep each file identical (retain file size)?

Question

When I run a simple distcp command:

hadoop distcp s3://src-bucket/src-dir s3://dest-bucket/dest-dir

I get a slight discrepancy on the size (in bytes) of src-dir and dest-dir

>aws s3 --summarize s3://dest-bucket/dest-dir/
...
Total Objects: 12290
   Total Size: 64911104881181

>aws s3 --summarize s3://dest-bucket/dest-dir/
...
Total Objects: 12290
   Total Size: 64901040284124

My Question is:

What could have introduced this discrepancy? Is the content of my dest dir still the same as the original?
Most importantly - are there parameters I can set to ensure each file looks exactly the same as their src counter-part (ie same file size)?

score 0 · Answer 1 · answered Jun 19 '17 at 17:25

What could have introduced this discrepancy? Is the content of my dest dir still the same as the original?

Is it possible that there was a concurrent write activity happening in src-dir at the same time that the DistCp was running? For example, was there a file open for write in src-dir by some other application, and the application was writing content to the file while the DistCp ran?

Eventual consistency effects at S3 also can come into play, particularly around updates of existing objects. If an application overwrites an existing object, then there is a window of time afterward where applications reading that object might see the old version of the object, or they might see the new version. More details on this are available in the AWS documentation of the Amazon S3 Data Consistency Model.

Most importantly - are there parameters I can set to ensure each file looks exactly the same as their src counter-part (ie same file size)?

In general, DistCp will perform a CRC check of each source file against the new copy at the destination to confirm that it was copied correctly. I noticed your are using the S3 file system instead of HDFS though. For S3, like many of the alternative file systems, there is a limitation that this CRC verification cannot be performed.

As an added note, the S3FileSystem (URIs with s3:// for the scheme) is effectively deprecated, unmaintained by the Apache Hadoop community and poorly supported. If possible, we recommend that users migrate to S3AFileSystem (URIs with s3a:// for the scheme) for improved features, performance and support. There are more details Integration with Amazon Web Services documentation for more details.

If you cannot find an explanation for the behavior you are seeing with s3://, then it is possible there is a bug lurking there, and you might be better served trying s3a://. (If you have existing data that was already written using s3:// though, then you'd need to figure out some kind of migration for that data first, such as by copying from an s3:// URI to an equivalent s3a:// URI.)

Chris, pl0u was using the AWS s3 tools, so s3:// is all they have to play with. They'll need to move to the Hadoop libs to play with distcp and our code — stevel, Jun 20 '17 at 14:45
Hey @ChrisNauroth thank you for your insight. 1. We don't have concurrent writing. 2. I looked at the counters of the distcp mapreduce job: S3: Number of bytes read=2600370 S3: Number of bytes written=2600794 It looks like S3: Number of bytes read is not qual to the writes counter. These numbers reflect the reported file sizes when I do an aws ls. Could this mean that the mapreduce job is actually writing out a different amount of data than it has read in? — pl0u, Jun 22 '17 at 10:19

rgs · Answer 2 · 2017-06-22T18:48:52.653

My take is there's a difference in how src is compressed and how dst is compressed (or not). So I'd say:

1) check the .*compress.* settings for whatever creates src

2) make sure they match the .*compress.* settings of the distcp job

Compression algorithms -- using the same settings -- should produce deterministic output. So I suspect a mismatch in compress on origin vs compression (or not) in destination.

Hadoop distcp -possible to keep each file identical (retain file size)?

2 Answers2