0

i just started learning spark using python and found the following initialization of PySpark's SparkContext class

sc = SparkContext(master="local[24]",pyFiles=['codes/spark_codes.py'])

I read the documentation, it mentioned that it is used to send files to the cluster and add to the PYTHONPATH. But, since we will always be running the codes from the master, what is the use of this particular argument?

Alper t. Turker
  • 34,230
  • 9
  • 83
  • 115
rawwar
  • 4,834
  • 9
  • 32
  • 57
  • The code can be distributed to executors. It doesn't always run on the master – OneCricketeer Jul 28 '18 at 23:22
  • but, once we import some code , using from something import myclass and use it in the current context.how different is this from the pyfiles – rawwar Jul 28 '18 at 23:24
  • 1
    You must have a local module that is able to be imported. That's a different concept of actually getting those modules to other machines. Since you're not using an external Spark Cluster scheduler, and only running locally where all modules are available to you, then there is no need for that argument – OneCricketeer Jul 28 '18 at 23:26
  • i do have multiple systems connected. I thought, once i import some code in the master node and then later when i execute methods like collect(), the code will be distributed to all the nodes. Am i wrong? – rawwar Jul 28 '18 at 23:27
  • `master=local` will only run on a single machine, not distribute anything except across 24 CPU cores, which is the number you've given – OneCricketeer Jul 28 '18 at 23:28
  • @cricket_007, thank you very much for pointing that out, but, i am kind of interested about the later part of my previous comment, which is "once i import some code in the master node and then later when i execute methods like collect(), the code will be distributed to all the nodes" – rawwar Jul 28 '18 at 23:30
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/176954/discussion-between-inaflash-and-cricket-007). – rawwar Jul 28 '18 at 23:40
  • 1
    Each executor node is responsible for collecting its assigned task data, then it must be serialized and returned back to the driver node. It's typically not encouraged to use collect, but rather write the results back to HDFS (or other shared filesystem) or a database. The only thing I'd use collect for is maybe converting back into a local Pandas dataframe – OneCricketeer Jul 29 '18 at 00:07

0 Answers0