0

Class org.apache-spark.SparkException, java.lang.NoClassDefFoundError: Could not initialize class XXX(class where field validation exists) Exception when I am trying to do field validations on Spark Dataframe. Here is my code

And all classes and object used are serialized. Fails on AWS EMR spark job (works fine in local Machine.)

val newSchema = df.schema.add("errorList", ArrayType(new StructType()
.add("fieldName" , StringType)
.add("value" , StringType)
.add("message" , StringType)))

    //Validators is a Sequence of validations on columns in a Row.
    // Validator method signature 
    // def checkForErrors(row: Row): (fieldName, value, message) ={
    // logic to validate the field in a row }
    
    val validateRow: Row => Row = (row: Row)=>{
    val errorList = validators.map(validator => validator.checkForErrors(row)
    Row.merge(row, Row(errorList))
    }
    
    
    val validateDf = df.map(validateRow)(RowEncoder.apply(newSchema)) 

Versions : Spark 2.4.7 and Scala 2.11.8

Any ideas on why this might happen or if someone had the same issue.

Dmytro Mitin
  • 48,194
  • 3
  • 28
  • 66
  • `NoClassDefFoundError` https://stackoverflow.com/questions/34413/why-am-i-getting-a-noclassdeffounderror-in-java https://stackoverflow.com/questions/1457863/what-causes-and-what-are-the-differences-between-noclassdeffounderror-and-classn – Dmytro Mitin Feb 07 '23 at 02:10
  • Spark 2.4.7 is published for Scala 2.11.x, 2.12.x https://mvnrepository.com/artifact/org.apache.spark/spark-core Maybe some issue with dependencies. – Dmytro Mitin Feb 07 '23 at 10:54
  • Thanks for the study on the errors. But that did not help. – Dheeraj Garikapati Feb 08 '23 at 14:58
  • https://stackoverflow.com/questions/7325579/java-lang-noclassdeffounderror-could-not-initialize-class-xxx https://github.com/twitter/finagle/issues/634#issuecomment-320992894 – Dmytro Mitin Feb 09 '23 at 07:47

1 Answers1

0

I faced a very similar problem with EMR release 6.8.0 - in particular, the spark.jars configuration was not respected for me on EMR (I pointed it at a location of a JAR in S3), even though it seems to be normally accepted Spark parameter.

For me, the solution was to follow this guide ("How do I resolve the "java.lang.ClassNotFoundException" in Spark on Amazon EMR?"): https://aws.amazon.com/premiumsupport/knowledge-center/emr-spark-classnotfoundexception/

In CDK (where our EMR cluster definitino is), I set up an EMR step to be executed immediately after cluster creation the rewrite the spark.driver.extraClassPath and spark.executor.extraClassPath to also contain the location of my additional JAR (in my case, the JAR physically comes in a Docker image, but you could also set up a boostrap action to copy it on the cluster from S3), as per their code in the article under "For Amazon EMR release version 6.0.0 and later,". The reason you have to do this "rewriting" is because EMR already populates these spark.*.extraClassPath with a bunch of its own JAR location, e.g. for JARs that contain the S3 drivers, so you effectively have to append your own JAR location, rather than just straight up setting the spark.*.extraClassPath to your location. If you do the latter (I tried it), then you will lose lot of the EMR functionality such as being able to read from S3.

#!/bin/bash
#
# This is an example of script_b.sh for changing /etc/spark/conf/spark-defaults.conf
#
while [ ! -f /etc/spark/conf/spark-defaults.conf ]
do
  sleep 1
done
#
# Now the file is available, do your work here
#
sudo sed -i '/spark.*.extraClassPath/s/$/:\/home\/hadoop\/extrajars\/\*/' /etc/spark/conf/spark-defaults.conf
exit 0