0

I'm new to Spark, and read a lot of articles about Spark shuffle. Most of them mentioned Spark writes the shuffle files to local disk. What I don't understand is how subsequent remote worker nodes read these shuffle files.

thebluephantom
  • 16,458
  • 8
  • 40
  • 83
Alex
  • 11
  • 3

1 Answers1

0

This is the default Shuffle service behavior, where shuffle server is co-located on worker, worker uses TCP to download data from other workers.

This is a very important feature to tune in order to improve Spark performance: disk & network (linux kernel TCP stack also).

This is configurable with: https://spark.apache.org/docs/latest/configuration.html#shuffle-behavior .

I am feeling shuffling stuff will be the next big improvement of Spark, with open-sourced external shuffle service (like Facebook Cosco).

Thomas Decaux
  • 21,738
  • 2
  • 113
  • 124