1

When I run a spark job written with pyspark, I get a jvm running which has an Xmx1g setting I cannot seem to set. Here is ps aux output:

 /usr/lib/jvm/jre/bin/java -cp /home/ec2-user/miniconda3/lib/python3.6/site-packages/pyspark/conf:/home/****/miniconda3/lib/python3.6/site-packages/pyspark/jars/* -Xmx1g org.apache.spark.deploy.SparkSubmit pyspark-shell

My question is, how do I set this property? I can set the master memory by using SPARK_DAEMON_MEMORY and SPARK_DRIVER_MEMORY but this doesn't affect pyspark's spawned process.

I already tried JAVA_OPTS or actually looking at the packages /bin files but couldn't understand where this is set.

Setting spark.driver.memory and spark.executor.memory in the job context itself didn't help as well.

Edit:

After moving to submit jobs with spark-submit (the code and infrastructure were evloved from standalone configuration) - everything was resolved. Submitting programmatically (using SparkConf) seems to override some of the cluster's setup.

Reut Sharabani
  • 30,449
  • 6
  • 70
  • 88
  • Could you explain what problem you are trying to solve? PySpark will respect `spark.*.memory` properties, same as any other package, but JVM heap is mostly irrelevant for PySpark programs. `spark.*.memoryOverhead` options are typically more in place. And `SPARK_DAEMON_MEMORY` is a cluster manager property, not application, so not really related to `spark.*.memory` (not to mention there is no Python code nearby). – zero323 Apr 30 '18 at 15:46
  • And you if you want to limit Python heap size - [How to limit the heap size?](https://stackoverflow.com/q/2308091) – zero323 Apr 30 '18 at 15:52
  • I'm not sure I understand. This isn't the master (cluster-manager) process, nor is it the python process. It is an extra java process (running `python-shell` with `SparkSubmit`) spawned when submitting a job using `pyspark`. It adds to the classpath jars from the package itself and runs a jvm to submit the job. I want to change its heap size. Is it possible? – Reut Sharabani Apr 30 '18 at 16:08
  • There is no separate JVM for PySpark other than JVM driver, and it respects `spark.driver.memory`. This however, is not Python driver memory. And what I pointed out above is that `SPARK_DAEMON_MEMORY` is Standalone Cluster configuration and has nothing to do with app (sorry if it was confusing). – zero323 Apr 30 '18 at 16:22
  • @user6910411 thanks for all this information. I'm very fresh to spark/pyspark. We did go from standalone configuration to cluster, but I still see the `SPARK_DAEMON_MEMORY` affects the `jvm` spawned by `start-master.sh`. However, when I submit a job programmatically it spawns another jvm (which I've attached as `ps aux output`). `spark.driver.memory` does nothing to change that jvm's heap size (neither as a system property submitted as part of the context nor in `spark-env.sh`. Sorry if our setup is not standard but it was an incremental process :) – Reut Sharabani Apr 30 '18 at 16:31
  • 1
    I think (correct me if I am wrong) the confusion here is what `start-master.sh` does. Master, is a part resource manger (here use `SPARK_DAEMON_MEMORY`), it is not related to driver or application as such It is completely different, a not even obligatory component. `spark.driver.memory` is a driver property. It is specifically connected to the JVM started by `SparkSubmit` - that's where you use `spark.driver.memory`. And finally there is Python driver, which is plain Python process that used to instrument JVM driver, when using PySpark. No JVM there, so JVM heap settings. Does it make sense? – zero323 Apr 30 '18 at 16:42
  • Reading what you've written and going over the docs again, I suspect that somehow `spark.driver.memory` is being ignored. I will update if I'll have findings. Otherwise we'll just set the whole thing up from scratch to make sure its not something we set and forgot. – Reut Sharabani May 01 '18 at 04:57
  • @user6910411 ok, so the problem was I naively used sparkcontext from within python to submit the job, and just ran the code as python code (not submitting it to spark). After changing that everything is resolved. Thank you for your help! – Reut Sharabani May 01 '18 at 07:01
  • Does this answer your question? [Total allocation exceeds 95.00% (960,285,889 bytes) of heap memory- pyspark error](https://stackoverflow.com/questions/53407442/total-allocation-exceeds-95-00-960-285-889-bytes-of-heap-memory-pyspark-erro) – Doron Yaacoby May 26 '20 at 11:45

1 Answers1

-1

You can use --conf spark.driver.extraJavaOptions and --conf spark.executor.extraJavaOptions after spark-submit, for example:

SPARK_LOCATION/spark-submit --verbose --master yarn-cluster --num-executors 15 --conf spark.driver.cores=3 ....... --conf spark.driver.extraJavaOptions="-Xss10m -XX:MaxPermSize=1024M " --conf spark.executor.extraJavaOptions="-Xss10m -XX:MaxPermSize=512M " .....
user3689574
  • 1,596
  • 1
  • 11
  • 20