Abnormally high CPU consumption in PySpark

Question

We have a moderately large PySpark program that we run on a Mesos cluster.

We run the program with spark.executor.cores=8 and spark.cores.max=24. Each Mesos node has 12 vcpu, so that only 1 executor is started on each node.

The program runs flawlessly, with correct results.

However, the issue is that each executor consumes much more CPU than 8. CPU load frequently reaches 25 or more. With the htop program, we see that 8 python processes are started, as expected. However, each Python spawn several threads, so each python process can go up to 300% CPU.

This behavior is annoying in a shared cluster deployment.

Can someone explain this behavior ? What are these 3 additional threads that pyspark starts ?

Additional infos:

The functions we use in our Spark operations are not multithreaded
We have the same behavior in local mode, outside of Mesos
We use Spark 2.1.1 and Python 3.5
Nothing else runs on the Mesos nodes, excepted the usual base services
In our test platform, Mesos nodes are actually OpenStack VM

Are you 101% sure that your program doesn't spwn this new threads? [Example](https://stackoverflow.com/questions/30214474/how-to-run-multiple-jobs-in-one-sparkcontext-from-separate-threads-in-pyspark) that *may* be relevant. — gsamaras, Sep 06 '17 at 08:16
Interesting that your java processes show 0 usage. For us part of the excess usage is the java processes, even on lines like `val df = spark.read.parquet(path) val grpd = dataFrame.rdd.map(lambda x: (x[0], list(x[1:]))).groupByKey()` where you'd assume that for sure no multi-threading is used. For this case we can contain the problem abit with `spark.task.cpus=2` to account for the jvm overhead. The problem is getting worse for more Python-heavy stages, e.g. the following `.mapValues` - here it's not just jvm overhead but also more processes/executors than tasks, as in your example. — fanfabbb, Sep 25 '17 at 11:54
I had a similar problem but in my case the high CPU load was only there till the input file was read (I use `read_json`). During reading, the java container (that spawned python process) had a CPU consumption of > 100% and the python process too was at more than 100%. We only have 1 vcore per executor. After the file had been read, the normal processing would resume with commensurate loads. — RedBaron, Aug 03 '18 at 09:51

Abnormally high CPU consumption in PySpark

0 Answers0