In spark what is the meaning of spark.executor.pyspark.memory configuration option?

Question

Documentation explanation is given as:

The amount of memory to be allocated to PySpark in each executor, in MiB unless otherwise specified. If set, PySpark memory for an executor will be limited to this amount. If not set, Spark will not limit Python's memory use, and it is up to the application to avoid exceeding the overhead memory space shared with other non-JVM processes. When PySpark is run in YARN or Kubernetes, this memory is added to executor resource requests. Note: This feature is dependent on Python's resource module; therefore, the behaviours and limitations are inherited. For instance, Windows does not support resource limiting, and actual resource is not limited on macOS.

There are two other configuration options. One controlling the amount of memory allocated to each executor - spark.executor.memory and, another controlling the amount of memory that each python process within an executor can use before it starts to spill memory over to disk - spark.python.worker.memory

Can someone please explain what then is the behaviour and use of spark.executor.pyspark.memory configuration and in what ways is it different from spark.executor.memory and spark.python.worker.memory?

@Steven - Edited the question to make the source of the confusion precise — figs_and_nuts, Jul 05 '21 at 14:42

score 13 · Accepted Answer · edited Jun 13 '23 at 09:21

I extended my answer a little bit. And please, follow the links, at the end of the article, they are pretty useful and have some pictures that help to understand the whole picture of spark memory management.

We should dig into spark memory management(mm) to figure out what is spark.execution.pyspark.memory.

So, first of all, there are two big parts of spark mm:

Memory inside JVM;
Memory outside JVM.

Memory inside JVM is divided into 4 parts:

Storage memory - this memory is for spark cached data, broadcast variables, etc;
Execution memory - this memory is for storing data required during execution spark tasks;
User memory - this memory is for user purposes. You can store here your custom data structure, UDFs, UDAFs, etc;
Reserved memory - this memory is for spark purposes and it hardcoded to 300MB as of spark 1.6.

Memory outside JVM is divided into 2 parts:

OffHeap memory - this memory of things outside JVM, but for JVM purposes or this memory is used for Project Tungsten;
External process memory - this memory is specific for SparkR or PythonR and used by processes that resided outside of JVM.

So, the parameter spark.executor.memory(or --executor-memory for spar-submit) responds how much memory will allocate inside JVM Heap per exectuor. This memory will split between: reserved memory, user memory, execution memory, storage memory. To control this splitting we need 2 more parameters: spark.memory.fraction and spark.memory.storageFraction

According to spark documentation:

spark.memory.fraction is responsible for fraction of heap used for execution and storage;

spark.memory.storageFraction is responsible for to amount of storage memory immune to eviction, expressed as a fraction of the size of the region set aside by spark.memory.fraction. So if storage memory isn't used, execution memory may acquire all the available memory and vice versa. This parameter controls how much memory execution can evict if necessary.

More details here

Please look pictures of Heap memory parts here

Finally, Heap will be split in a next way:

Reserved memory is hardcoded to 300MB
User memory will calculate as (spark.executor.memory - reserved memory) * (1 - spark.memory.fraction)
Spark memory(which consists of Storage memory and Execution memory) will calculate as (spark.executor.memory - reserved memory) * spark.memory.fraction. Then all this memory will split between Storage memory and Execution memory with spark.memory.storageFraction parameter.

The next parameter you asked about is spark.executor.pyspark.memory. It's a part of External process memory and it's responsible for how much memory python daemon will able to use. Python daemon is used, for example, for executing UDFs had written on python.

And the last one is spark.python.worker.memory. In this article I had found the next explanation: JVM process and Python process communicate to each other with py4J bridge that exposes objects between JVM and Python. So spark.python.worker.memory is controlling how much memory can be occupied by py4J for creating objects before spilling them to the disk.

You can read about mm more in the next articles:

In spark what is the meaning of spark.executor.pyspark.memory configuration option?

1 Answers1