Scenario:
Kafka -> Spark Streaming
Logic in each Spark Streaming microbatch (30 seconds):
Read Json->Parse Json->Send to Kafka
My streaming job is reading from around 1000 Kafka topics, with around 10K Kafka partitions, the throughput was around 5 million events/s.
The issue is coming from uneven traffic load between Kafka partitions, some partitions have throughput about 50 times of smaller ones, this results in skewed RDD partitions (as KafkaUtils creates 1:1 mapping from Kafka partitions to Spark partitions) and really hurt the overall performance, because for each microbatch, most executors are waiting for the one that has largest load to finish, I know this by looking at Spark UI, at some point of each microbatch, there`s only a few executors have "ACTIVE" tasks, all other executors are done with their task and waiting, also by looking at task time distribution, MAX is 2.5 minute but MEDIAN is only 20 seconds.
Notes:
- Spark Streaming not Structured Streaming
- I am aware of this post Spark - repartition() vs coalesce() , I`m not asking about difference between repartition() or coalesce(), load is consistent, so not relevant to autoscaling or dynamic allocation also
What I tried:
- Coalesce() helps a little bit but does not remove the skewness and sometimes even worse, also comes with a higher risk to OOM on executors.
- Repartition() does remove skewness but full shuffling is simply too expensive at this scale, the penalty does not payback on execution time for each batch, increasing the batch time does not work also because when batch time increases, load increases for each microbatch and the work load to shuffle increases also
How do I make workload more evenly distributed between Spark executors so that resources are being used more efficiently? And performance would be better?