1

My Spark job fails with the following error:

java.lang.IllegalArgumentException: Required executor memory (33792 MB), offHeap memory (0) MB, overhead (8192 MB), and PySpark memory (0 MB) 
is above the max threshold (24576 MB) of this cluster! 
Please check the values of 'yarn.scheduler.maximum-allocation-mb' and/or 'yarn.nodemanager.resource.memory-mb'.

I have defined executor memory to be 33g and executor memory overhead to be 8g. However, the total should be less than or equal to 24g as per the error log. Can someone help me understand what exactly does 24g refer to? Is it the RAM on the master node or something else? Why is it capped to 24g? Once I figure it out, I can programmatically calculate my other values to not run into this issue again.

Setup: Running make command which houses multiple spark-submit commands on Jenkins which launches it on an AWS EMR cluster running Spark 3.x

mazaneicha
  • 8,794
  • 4
  • 33
  • 52
Shubham Meshram
  • 175
  • 1
  • 9

1 Answers1

0

This error is happening because you're requesting more resources than is available on the cluster (org.apache.spark.deploy.yarn.Client source). For your case specifically (AWS EMR), I think you should check the value of yarn.nodemanager.resource.memory-mb as message says (in yarn-site.xml or via NodeManager Web UI), and do not try to allocate more than this value per YARN container memory.

mazaneicha
  • 8,794
  • 4
  • 33
  • 52
  • Hi, I am able to view the value (24g) on the YARN webpage, however, when viewing the same for another cluster it is set at 120g. I am trying to understand how is this value decided? is it the sum of RAM of certain type of nodes? or something else? – Shubham Meshram Jul 12 '22 at 09:47
  • Does this help? https://stackoverflow.com/questions/43826703/difference-between-yarn-scheduler-maximum-allocation-mb-and-yarn-nodemanager – mazaneicha Jul 12 '22 at 12:49
  • Hi, I used the UNIX command ```locate``` to find the xml file and I am able to see the value set to 24g inside it, can also see the same on the YARN UI. I am left to believe this value is set based on the scaling activities in EMR and how much resources are available at any given point of time. is that assumption correct? – Shubham Meshram Jul 13 '22 at 07:40
  • Most likely, it depends on the type of AWS instances in the cluster. – mazaneicha Jul 13 '22 at 12:00
  • Currently, I am writing some python code to fetch the resources information (total cores, RAM, usable cores, RAM) during run time and then try to generate spark settings and only then start the actual job - this is to ensure the best combination of spark settings are applied to make the most of distributed computing (this is with an assumption that best output is seen only when all the worker nodes are of same EC2 instance type – Shubham Meshram Jul 16 '22 at 08:43
  • Well, good luck! although I'm a bit skeptical about this approach. Partially bcause of https://blog.cloudera.com/how-to-tune-your-apache-spark-jobs-part-2/ – mazaneicha Jul 16 '22 at 13:47
  • Hi @mazaneicha, can you pls look at another question of mine - https://stackoverflow.com/questions/73105484/spark-job-crashes-with-error-in-prelaunch-err – Shubham Meshram Jul 25 '22 at 07:34