I have a very simple word-count-like program that generates (Long, Double) counts like that:
val lines = sc.textFile(directory)
lines.repartition(600).mapPartitions{lineIterator =>
// Generate iterator of (Long,Double) counts
}
.reduceByKey(new HashPartitioner(30), (v1, v2) => v1 + v2).saveAsTextFile(outDir, classOf[GzipCodec])
My problem: The last of the 30 partitions never gets written.
Here are a few details:
- My input is 5 GB gz-compressed and I expect about 1B unique Long keys.
- I run on a 32 core 1.5TB machine. Input and output come from a local disk with 2TB free. Spark is assigned to use all the ram and happily does so. This application occupies about 0.5 TB.
I can observe the following:
- For 29 partitions the reduce and repartition (because of the HashPartitioner) takes about 2h. The last one does not finish, not even after a day. Two to four threads stay on 100%.
- No error or warning appears in the log
- Spark occupies about 100GB in /tmp which aligns with what the UI reports for shuffle write.
- In the UI I can see the number of "shuffle read records" growing very, very slowly for the remaining task. After one day, still one magnitude away from what all the finished tasks show.
The last log looks like that:
15/08/03 23:26:43 INFO SparkHadoopWriter: attempt_201508031748_0002_m_000020_748: Committed
15/08/03 23:26:43 INFO Executor: Finished task 20.0 in stage 2.0 (TID 748). 865 bytes result sent to driver
15/08/03 23:27:50 INFO FileOutputCommitter: Saved output of task 'attempt_201508031748_0002_m_000009_737' to file:/output-dir/_temporary/0/task_201508031748_0002_m_000009
15/08/03 23:27:50 INFO SparkHadoopWriter: attempt_201508031748_0002_m_000009_737: Committed
15/08/03 23:27:50 INFO Executor: Finished task 9.0 in stage 2.0 (TID 737). 865 bytes result sent to driver
15/08/04 02:44:54 INFO BlockManager: Removing broadcast 3
15/08/04 02:44:54 INFO BlockManager: Removing block broadcast_3_piece0
15/08/04 02:44:54 INFO MemoryStore: Block broadcast_3_piece0 of size 2009 dropped from memory (free 611091153849)
15/08/04 02:44:54 INFO BlockManagerMaster: Updated info of block broadcast_3_piece0
15/08/04 02:44:54 INFO BlockManager: Removing block broadcast_3
15/08/04 02:44:54 INFO MemoryStore: Block broadcast_3 of size 3336 dropped from memory (free 611091157185)
15/08/04 02:44:54 INFO BlockManager: Removing broadcast 4
15/08/04 02:44:54 INFO BlockManager: Removing block broadcast_4_piece0
15/08/04 02:44:54 INFO MemoryStore: Block broadcast_4_piece0 of size 2295 dropped from memory (free 611091159480)
15/08/04 02:44:54 INFO BlockManagerMaster: Updated info of block broadcast_4_piece0
15/08/04 02:44:54 INFO BlockManager: Removing block broadcast_4
15/08/04 02:44:54 INFO MemoryStore: Block broadcast_4 of size 4016 dropped from memory (free 611091163496)
Imagine the first five lines repeated for 28 other partitions within a two minute time frame.
I have tried several things:
- Spark 1.3.0 and 1.4.0
- nio instead of netty
- flatMap instead of mapPartitions
- Just 30 instead of 600 input partitions
Still, I never get the last 1/30 of my data out of spark. Did anyone ever observe something similar? These two posts here and here seem to describe similar problems but no solution.
UPDATE
The task that never finishes is always the first task of the reduceKey+writeToTextFile. I have also removed the HashPartitioner and even tried on a bigger cluster with 400 cores and 6000 partitions. Only 5999 finish successfully, the last runs forever.
The UI shows for all tasks something like Shuffle Read Size / Records: 20.0 MB / 1954832 but for the first it shows (at the moment) Shuffle Read Size / Records: 150.1 MB / 711836
Numbers still growing....