How Spark share shuffle files?

Question

I'm new to Spark, and read a lot of articles about Spark shuffle. Most of them mentioned Spark writes the shuffle files to local disk. What I don't understand is how subsequent remote worker nodes read these shuffle files.

https://stackoverflow.com/questions/58699907/spark-disk-i-o-on-stage-boundaries-explanation/58841524#58841524 — thebluephantom, Sep 24 '22 at 09:43

score 0 · Accepted Answer · answered Oct 13 '22 at 11:51

This is the default Shuffle service behavior, where shuffle server is co-located on worker, worker uses TCP to download data from other workers.

This is a very important feature to tune in order to improve Spark performance: disk & network (linux kernel TCP stack also).

This is configurable with: https://spark.apache.org/docs/latest/configuration.html#shuffle-behavior .

I am feeling shuffling stuff will be the next big improvement of Spark, with open-sourced external shuffle service (like Facebook Cosco).

How Spark share shuffle files?

1 Answers1