1

I am running a pyspark job in ipython console. I set spark master as local[4]. So I expect one core for the driver, which should be a Java process, and the other three cores each runs a Python process. However, this the screen shot of my top,

enter image description here

Why there are 16 python processes? Why isn't there only 3 python processes? If I remember right, 16 is the total number of cores on this server.

HoldOffHunger
  • 18,769
  • 10
  • 104
  • 133
panc
  • 817
  • 2
  • 14
  • 30
  • why should python should care about java process? – YOU Jul 18 '16 at 01:13
  • @YOU Based on my understanding of PySpark, when a core in a Spark executor needs to run a task it will launch a Python process to do the actual computation. Since I specified `local[4]`, one core will be used for the driver, and each of the other three cores will be launching a python process to compute for tasks. – panc Jul 18 '16 at 02:23
  • relevant: https://www.youtube.com/watch?v=7ooZ4S7Ay6Y&feature=youtu.be&t=1h30m30s – Kristian Jul 18 '16 at 17:10
  • @Kristian Thank you. This video is very helpful. So when I set master as `local[N]`, Spark will start one executor (only one JVM) with `N` potential threads to run tasks. In my example, I have 4 potential threads to run tasks. But it is also mentioned in the video that Spark also has some "internal threads", so "there is no way to match the number of threads with the number of cores". But that doesn't answer why there are so many Python processes. I guess I need to find how pyspark works. – panc Jul 18 '16 at 21:18

1 Answers1

0

Take a look at here if you haven't done so.

You have decided to use four workers, each with one executor by default. However, one executor is running a few tasks, each of which is a python process.

An excellent explanation on the topic is given here.

Community
  • 1
  • 1
shuaiyuancn
  • 2,744
  • 3
  • 24
  • 32
  • The reference seems to only related to cluster mode. I am using local mode on a single computer. So there is only one executor with multiple threads. My understanding is that each thread launches Python to run a task. Maybe my understanding is wrong. – panc Jul 18 '16 at 21:20
  • @PanChao I think you got confused by "work threads" mentioned [here](http://spark.apache.org/docs/latest/submitting-applications.html#master-urls). A worker = an Executor by default, so it actually will spawn multiple python threads, each of which to deal with a Task. – shuaiyuancn Aug 08 '16 at 10:37