I extended my answer a little bit. And please, follow the links, at the end of the article, they are pretty useful and have some pictures that help to understand the whole picture of spark memory management.
We should dig into spark memory management(mm) to figure out what is spark.execution.pyspark.memory.
So, first of all, there are two big parts of spark mm:
- Memory inside JVM;
- Memory outside JVM.
Memory inside JVM is divided into 4 parts:
- Storage memory - this memory is for spark cached data, broadcast variables, etc;
- Execution memory - this memory is for storing data required during execution spark tasks;
- User memory - this memory is for user purposes. You can store here your custom data structure, UDFs, UDAFs, etc;
- Reserved memory - this memory is for spark purposes and it hardcoded to 300MB as of spark 1.6.
Memory outside JVM is divided into 2 parts:
- OffHeap memory - this memory of things outside JVM, but for JVM purposes or this memory is used for Project Tungsten;
- External process memory - this memory is specific for SparkR or PythonR and used by processes that resided outside of JVM.
So, the parameter spark.executor.memory(or --executor-memory for spar-submit) responds how much memory will allocate inside JVM Heap per exectuor. This memory will split between: reserved memory, user memory, execution memory, storage memory. To control this splitting we need 2 more parameters: spark.memory.fraction and spark.memory.storageFraction
According to spark documentation:
spark.memory.fraction is responsible for fraction of heap used for execution and storage;
spark.memory.storageFraction is responsible for to amount of
storage memory immune to eviction, expressed as a fraction of the
size of the region set aside by spark.memory.fraction. So if
storage memory isn't used, execution memory may acquire all the
available memory and vice versa. This parameter controls how much
memory execution can evict if necessary.
More details here
Please look pictures of Heap memory parts here
Finally, Heap will be split in a next way:
- Reserved memory is hardcoded to 300MB
- User memory will calculate as (spark.executor.memory - reserved memory) * (1 - spark.memory.fraction)
- Spark memory(which consists of Storage memory and Execution memory) will calculate as (spark.executor.memory - reserved memory) * spark.memory.fraction. Then all this memory will split between Storage memory and Execution memory with spark.memory.storageFraction parameter.
The next parameter you asked about is spark.executor.pyspark.memory. It's a part of External process memory and it's responsible for how much memory python daemon will able to use. Python daemon is used, for example, for executing UDFs had written on python.
And the last one is spark.python.worker.memory. In this article I had found the next explanation: JVM process and Python process communicate to each other with py4J bridge that exposes objects between JVM and Python. So spark.python.worker.memory is controlling how much memory can be occupied by py4J for creating objects before spilling them to the disk.
You can read about mm more in the next articles: