EMR: Issue in running pyspark application

Question

I am currently working with EMR 6.4.0 and facing an issue while running a pyspark application. The code was working fine but suddenly it started failing. I am currently stuck with two errors to which i have no clue how to resolve it.

The objective of the code is to get data read data from snowflake, save temporary data on S3 and write data back on different snowflake table at the end.

1) No Class found exception: I am getting the below error on my EMR spark steps. I tried looking into many post but i am still not clear how to fix this:

SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
**Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/shaded/javax/ws/rs/core/NoContentException**
    at org.apache.hadoop.yarn.util.timeline.TimelineUtils.<clinit>(TimelineUtils.java:60)
    at org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.serviceInit(YarnClientImpl.java:200)
    at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
    at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:191)
    at org.apache.spark.deploy.yarn.Client.run(Client.scala:1327)
    at org.apache.spark.deploy.yarn.YarnClusterApplication.start(Client.scala:1764)
    at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:958)
    at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
    at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
    at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
    at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1046)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1055)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
**Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.shaded.javax.ws.rs.core.NoContentException**
    at java.net.URLClassLoader.findClass(URLClassLoader.java:387)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
    at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
    ... 13 more
Command exiting with ret '1'

I am submit my pyspark code using the below command in the EMR steps on a M4.large instance for test (In PROD, i have a bigger instance type - M5.8XLarge).

spark-submit --deploy-mode cluster --master yarn --driver-memory 4g --executor-memory 1g --executor-cores 1 --num-executors 1  --conf spark.rpc.message.maxSize=100  --jars /home/hadoop/configure_cluster/snowflake-jdbc-3.13.8.jar,/home/hadoop/configure_cluster/spark-snowflake_2.12-2.9.1-spark_3.1.jar --py-files /home/hadoop/spark_utils.zip /home/hadoop/weibull_2.py dev dafehv-dse-weibull-processing-dev

As shown in command above, i am trying to limit memory limits by specifying it in spark submit command. But i can see in logs that i get below error -

diagnostics: Uncaught exception: org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid resource request! Cannot allocate containers as requested resource is greater than maximum allowed allocation. Requested resource type=[memory-mb], Requested resource=<memory:35789, max memory:2147483647, vCores:2, max vCores:2147483647>, maximum allowed allocation=<memory:6144, vCores:4>, please note that maximum allowed allocation is calculated by scheduler based on maximum resource of registered NodeManagers, which might be less than configured maximum allocation=<memory:6144, vCores:128>

Why do spark try to dynamically allocate the containers with resources not mentioned in spark submit? I am lost here, I have been trying to find how to fix the above two issues from last week, but to no avail. I havent worked a lot with Spark, but can anyone please guide me how can i proceed to fix the issues?

Akshay · Answer 1 · 2022-11-16T18:04:18.333

Finally i found what the issue was.

Through many other blogs on stack overflow, i got to know that somehow the underlying spark and Hadoop version was different due to which i had to change my hadoop-aws and aws-sdk jars. This also made mechange the snowflake jdbc and snowflake-spark driver jars.

This is solved my problem and the code now seems to be running.

for point 2: I found that the driver and executor memory was configured in spark config, which took precedence over the spark-submit options.

Update: 14 Nov: I faced few more errors, all because of hadoop-aws jar issues. Checkout the below link (answer by LaserJesus) if that helps as it solved my problems:

https://stackoverflow.com/a/72931205/3186923

Update 16th Nov: Check this link as well. Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.shaded.javax.ws.rs.core.NoContentException

EMR: Issue in running pyspark application

1 Answers1