I'm basically running a sorting using spark. The spark program will read from HDFS, sort on composite keys, and then save the partitioned result back to HDFS. pseudo code is like this:
input = sc.textFile
pairs = input.mapToPair
sorted = pairs.sortByKey
values = sorted.values
values.saveAsTextFile
Input size is ~ 160G, and I made 1000 partitions specified in JavaSparkContext.textFile and JavaPairRDD.sortByKey. From WebUI, the job is splitted into two stages: saveAsTextFile and mapToPair. MapToPair finished in 8 mins. While saveAsTextFile took ~15mins to reach (2366/2373) progress and the last few jobs just took forever and never finishes.
Cluster setup:
8 nodes
on each node: 15gb memory, 8 cores
running parameters:
--executor-memory 12G
--conf "spark.cores.max=60"
Thank you for any help.