0

As I read from many blogs and posts here in SO, for example this one (in the first a few paragraphs), quoted as follows:

Not to get into too many details, but when you run different transformations on a RDD (map, flatMap, filter and others), your transformation code (closure) is:

  1. serialized on the driver node,
  2. shipped to the appropriate nodes in the cluster,
  3. deserialized,
  4. and finally executed on the nodes

OK, here is my take at this:

I define some custom transformation/action functions in the driver, and then those custom functions will be serialized to all the executors to run the job.

Then what's the point of shipping extra py-files to all the nodes? Since all that executors need will be serialized to them, what the heck is going on here?

Community
  • 1
  • 1
avocado
  • 2,615
  • 3
  • 24
  • 43
  • @thesonyman101, excuse me, but I don't think you understand my question, why and when do we need to use `--py-files` to ship local files to cluster. – avocado Apr 01 '17 at 05:12
  • What if you have several modules in your python code? You can't just submit the main driver and expect the executors to find the rest of the code that the driver tries to import – OneCricketeer Apr 01 '17 at 20:16

1 Answers1

0

Not sure, but use spark 2.x and DataFrame API to avoid serialization and to ship scala code to your nodes without dealing with extra python container on your nodes.