As I read from many blogs and posts here in SO, for example this one (in the first a few paragraphs), quoted as follows:
Not to get into too many details, but when you run different transformations on a RDD (map, flatMap, filter and others), your transformation code (closure) is:
- serialized on the driver node,
- shipped to the appropriate nodes in the cluster,
- deserialized,
- and finally executed on the nodes
OK, here is my take at this:
I define some custom transformation/action functions in the driver, and then those custom functions will be serialized to all the executors to run the job.
Then what's the point of shipping extra py-files to all the nodes? Since all that executors need will be serialized to them, what the heck is going on here?