We have a moderately large PySpark program that we run on a Mesos cluster.
We run the program with spark.executor.cores=8
and spark.cores.max=24
. Each Mesos node has 12 vcpu, so that only 1 executor is started on each node.
The program runs flawlessly, with correct results.
However, the issue is that each executor consumes much more CPU than 8. CPU load frequently reaches 25 or more.
With the htop
program, we see that 8 python processes are started, as expected. However, each Python spawn several threads, so each python process can go up to 300% CPU.
This behavior is annoying in a shared cluster deployment.
Can someone explain this behavior ? What are these 3 additional threads that pyspark starts ?
Additional infos:
- The functions we use in our Spark operations are not multithreaded
- We have the same behavior in local mode, outside of Mesos
- We use Spark 2.1.1 and Python 3.5
- Nothing else runs on the Mesos nodes, excepted the usual base services
- In our test platform, Mesos nodes are actually OpenStack VM