It's probably the commit process between 3&4; the Hadoop MR and spark committers assume that rename is an O(1) atomic operation, and rely on it to do atomic commits of work. On S3, rename is O(data) and non-atomic when multiple files in a directory are involved. the 0-CPU load is the giveaway: the client is just awaiting a response from S3, which is doing the COPY internally at 6-10 MB/S
There's work underway in HADOOP-13345 to do a 0-rename commit in S3. For now, you can look for the famed-but-fails-in-interesting-ways Direct Committer from Databricks.
One more thing: make sure you are using "algorithm 2" for commiting, as algorithm 1 does a lot more renaming in the final job master commit. My full recommended setting for ORC/Parquet perf on Hadoop 2.7 is (along with use s3a: urls):
spark.sql.parquet.filterPushdown true
spark.sql.parquet.mergeSchema false
spark.hadoop.parquet.enable.summary-metadata false
spark.sql.orc.filterPushdown true
spark.sql.orc.splits.include.file.footer true
spark.sql.orc.cache.stripe.details.size 10000
spark.sql.hive.metastorePartitionPruning true
spark.speculation false
spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version 2
spark.hadoop.mapreduce.fileoutputcommitter.cleanup.skipped true