S3Guard, HADOOP-13345 retrofits consistency to S3 by having DynamoDB store the listings. This makes it possible, for the first time, to reliably use S3A as a direct destination of work. Without that, execution time time may seem the problem, but the real one is the rename-based committer may get an inconsistent listing and not even see what files it has to rename.
The S3Guard Committer work HADOOP-13786 will, when finished (as of Aug 2017, still a work in progress), provides two committers.
Staging committer
- workers write to local filesystem
- Task committer uploads to S3 but does not complete the operation. Instead it saves commit metainfo to HDFS.
- This commit metainfo is committed as normal task/job data in HDFS.
- In Job commit, the committer reads the data of pending commits from HDFS and completes them, then does cleanup of any outstanding commits.
Task commit is an upload of all data, time is O(data/bandwidth
).
This is based on Ryan's s3committer at Netflix and is the one which is going to be safest to play with at first.
Magic committer
Called because it does "magic" inside the filesystem.
- the Filesystem itself recognises paths like
s3a://dest/__magic/job044/task001/__base/part-000.orc.snappy
- redirects the write to
s3a://dest/__magic/job044/task001/__base/part-000.orc.snappy
; doesn't complete the write in the stream close()
call.
- saves the commit metainfo to s3a, here
s3a://dest/__magic/job044/task001/__base/part-000.orc.snappy.pending
Task commit: loads all .pending files from that dir, aggregates, saves elsewhere. Time is O(files)
; data size unimportant.
Task abort: load all .pending files, abort the commits
- Job commit: load all pending files from committed tasks, completes.
Because it is listing files in S3, it will need S3Guard to deliver consistency on AWS S3 (other S3 implementations are consistent out the box, so don't need it).
Both committers share the same codebase, job commit for both will be O(files/threads)
, as they are all short POST requests which don't take up bandwidth or much time.
In tests, the staging committer is faster than the magic one for small test-scale files, because the magic committer talks more to S3, which is slow...though S3Guard speeds listing/getFileStatus calls up. The more data you write, the longer task commits on the staging committer take, whereas task commit for the magic one is constant for the same number of files. Both are faster than using rename(), due to how it is mimicked by list, copy
GCS and Hadoop/Spark Commit algorithms
(I haven't looked at the GCS code here, so reserve the right to be wrong. Tread Dennis Huo's statements as authoritative)
If GCS does rename() more efficiently than the S3A copy-then-delete, it should be faster, more O(file) than O(data), depending on parallelisation in the code.
I don't know if they can go for a 0-rename committer. The changes in the mapreduce code under FileOutputFormat
are designed to support different/pluggable committers for different filesystems, so they have the opportunity to do something here.
For now, make sure you are using the v2 MR commit algorithm, which, while less resilient to failures, does at least push the renames into task commit, rather than job commit.
See also Spark and Object Stores.