I'm new to Spark, and read a lot of articles about Spark shuffle. Most of them mentioned Spark writes the shuffle files to local disk. What I don't understand is how subsequent remote worker nodes read these shuffle files.
Asked
Active
Viewed 189 times
0
-
https://stackoverflow.com/questions/58699907/spark-disk-i-o-on-stage-boundaries-explanation/58841524#58841524 – thebluephantom Sep 24 '22 at 09:43
1 Answers
0
This is the default Shuffle service behavior, where shuffle server is co-located on worker, worker uses TCP to download data from other workers.
This is a very important feature to tune in order to improve Spark performance: disk & network (linux kernel TCP stack also).
This is configurable with: https://spark.apache.org/docs/latest/configuration.html#shuffle-behavior .
I am feeling shuffling stuff will be the next big improvement of Spark, with open-sourced external shuffle service (like Facebook Cosco).

Thomas Decaux
- 21,738
- 2
- 113
- 124