Huge delays translating the DAG to tasks

Question

this are my steps:

Submit the spark app to a EMR cluster
The driver starts and I can see the Spark-ui (no stages have been created yet)
The driver reads an orc file with ~3000 parts from s3, make some transformations and save it back to s3
The execution of the save should create some stages in the spark-ui but the stages take really long time to appear in the spark-ui
The stages appear and start the execution

Why am I getting that huge delay in step 4? During this time the cluster is apparently waiting for something and the CPU usage is 0%

Thanks

score 2 · Accepted Answer · answered Jan 09 '17 at 22:54

Despite its merits S3 is not a file system and it makes it a suboptimal choice for working with complex binary formats which are typically designed with actual file system in mind. In many cases secondary tasks (like reading metadata) are more expensive than the actual data fetching.

stevel · Answer 2 · 2017-01-12T17:01:29.627

It's probably the commit process between 3&4; the Hadoop MR and spark committers assume that rename is an O(1) atomic operation, and rely on it to do atomic commits of work. On S3, rename is O(data) and non-atomic when multiple files in a directory are involved. the 0-CPU load is the giveaway: the client is just awaiting a response from S3, which is doing the COPY internally at 6-10 MB/S

There's work underway in HADOOP-13345 to do a 0-rename commit in S3. For now, you can look for the famed-but-fails-in-interesting-ways Direct Committer from Databricks.

One more thing: make sure you are using "algorithm 2" for commiting, as algorithm 1 does a lot more renaming in the final job master commit. My full recommended setting for ORC/Parquet perf on Hadoop 2.7 is (along with use s3a: urls):

spark.sql.parquet.filterPushdown true
spark.sql.parquet.mergeSchema false
spark.hadoop.parquet.enable.summary-metadata false

spark.sql.orc.filterPushdown true
spark.sql.orc.splits.include.file.footer true
spark.sql.orc.cache.stripe.details.size 10000

spark.sql.hive.metastorePartitionPruning true
spark.speculation false
spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version 2
spark.hadoop.mapreduce.fileoutputcommitter.cleanup.skipped true

Huge delays translating the DAG to tasks

2 Answers2

Linked