Why we need to distribute files in Spark at all, e.g. --py-files?

Question

As I read from many blogs and posts here in SO, for example this one (in the first a few paragraphs), quoted as follows:

Not to get into too many details, but when you run different transformations on a RDD (map, flatMap, filter and others), your transformation code (closure) is:

serialized on the driver node,

shipped to the appropriate nodes in the cluster,

deserialized,

and finally executed on the nodes

OK, here is my take at this:

I define some custom transformation/action functions in the driver, and then those custom functions will be serialized to all the executors to run the job.

Then what's the point of shipping extra py-files to all the nodes? Since all that executors need will be serialized to them, what the heck is going on here?

@thesonyman101, excuse me, but I don't think you understand my question, why and when do we need to use `--py-files` to ship local files to cluster. — avocado, Apr 01 '17 at 05:12
What if you have several modules in your python code? You can't just submit the main driver and expect the executors to find the rest of the code that the driver tries to import — OneCricketeer, Apr 01 '17 at 20:16

score 0 · Answer 1 · answered Apr 01 '17 at 20:13

0

Not sure, but use spark 2.x and DataFrame API to avoid serialization and to ship scala code to your nodes without dealing with extra python container on your nodes.

answered Apr 01 '17 at 20:13

Sergio Alyoshkin

212
1
4

Question was about Python, not Scala – OneCricketeer Apr 01 '17 at 20:15
I got it, that is why I wrote my comment. Better use DataFrame API – Sergio Alyoshkin Apr 01 '17 at 21:08
This isn't an comment, though, it's an answer. Once you have more rep, you can comment. Python also has Dataframe API, by the way, but Spark 2 uses Dataset anyway – OneCricketeer Apr 01 '17 at 21:57

Why we need to distribute files in Spark at all, e.g. --py-files?

1 Answers1