Spark saveAsTextFile extremely slow near finish

Question

I'm basically running a sorting using spark. The spark program will read from HDFS, sort on composite keys, and then save the partitioned result back to HDFS. pseudo code is like this:

input = sc.textFile 
pairs = input.mapToPair 
sorted = pairs.sortByKey 
values = sorted.values 
values.saveAsTextFile

Input size is ~ 160G, and I made 1000 partitions specified in JavaSparkContext.textFile and JavaPairRDD.sortByKey. From WebUI, the job is splitted into two stages: saveAsTextFile and mapToPair. MapToPair finished in 8 mins. While saveAsTextFile took ~15mins to reach (2366/2373) progress and the last few jobs just took forever and never finishes.

Cluster setup:

8 nodes

on each node: 15gb memory, 8 cores

running parameters:

--executor-memory 12G

--conf "spark.cores.max=60"

Thank you for any help.

Just shooting from the hip here: Do you have replications turned on in HDFS? Can you look into whether HDFS starts replicating your partitions? — Tobber, Mar 09 '15 at 18:02
Could switch to kryo serialization, which is faster than standard java — aaronman, Mar 09 '15 at 18:21
To save the space in the hdfs cluster. I currently set replica factor only to 1, which means no replication. — Chandler Lee, Mar 09 '15 at 21:38
Can you look into this question: http://stackoverflow.com/questions/32342214/spark-1-4-1-saveastextfile-to-s3-is-very-slow-on-emr-4-0-0 — Ravindra babu, Mar 12 '16 at 21:29
Hi. Did you ever solve this? I have the exact same behavior. Task 1275/1276 is hanging there forever. The thread dump shows a bunch of locks and WAITING processes. Did you ever figure this out? Thanks in advance for any hints. — Jose Fonseca, May 21 '17 at 22:29

Spark saveAsTextFile extremely slow near finish

0 Answers0

Linked