After reading through the documentation I do not understand how does Spark running on YARN account for Python memory consumption.
Does it count towards spark.executor.memory
, spark.executor.memoryOverhead
or where?
In particular I have a PySpark application with spark.executor.memory=25G
, spark.executor.cores=4
and I encounter frequent Container killed by YARN for exceeding memory limits. errors when running a map
on an RDD. It operates on a fairly large amount of complex Python objects so it is expected to take up some non-trivial amount of memory but not 25GB. How should I configure the different memory variables for use with heavy Python code?