what is the use of PyFiles argument in sparkcontext of pyspark

Question

i just started learning spark using python and found the following initialization of PySpark's SparkContext class

sc = SparkContext(master="local[24]",pyFiles=['codes/spark_codes.py'])

I read the documentation, it mentioned that it is used to send files to the cluster and add to the PYTHONPATH. But, since we will always be running the codes from the master, what is the use of this particular argument?

The code can be distributed to executors. It doesn't always run on the master — OneCricketeer, Jul 28 '18 at 23:22
but, once we import some code , using from something import myclass and use it in the current context.how different is this from the pyfiles — rawwar, Jul 28 '18 at 23:24
You must have a local module that is able to be imported. That's a different concept of actually getting those modules to other machines. Since you're not using an external Spark Cluster scheduler, and only running locally where all modules are available to you, then there is no need for that argument — OneCricketeer, Jul 28 '18 at 23:26
i do have multiple systems connected. I thought, once i import some code in the master node and then later when i execute methods like collect(), the code will be distributed to all the nodes. Am i wrong? — rawwar, Jul 28 '18 at 23:27
`master=local` will only run on a single machine, not distribute anything except across 24 CPU cores, which is the number you've given — OneCricketeer, Jul 28 '18 at 23:28
@cricket_007, thank you very much for pointing that out, but, i am kind of interested about the later part of my previous comment, which is "once i import some code in the master node and then later when i execute methods like collect(), the code will be distributed to all the nodes" — rawwar, Jul 28 '18 at 23:30
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/176954/discussion-between-inaflash-and-cricket-007). — rawwar, Jul 28 '18 at 23:40
Each executor node is responsible for collecting its assigned task data, then it must be serialized and returned back to the driver node. It's typically not encouraged to use collect, but rather write the results back to HDFS (or other shared filesystem) or a database. The only thing I'd use collect for is maybe converting back into a local Pandas dataframe — OneCricketeer, Jul 29 '18 at 00:07

what is the use of PyFiles argument in sparkcontext of pyspark

0 Answers0