AWS EKS Spark 3.0, Hadoop 3.2 Error - NoClassDefFoundError: com/amazonaws/services/s3/model/MultiObjectDeleteException

Question

I'm running Jupyterhub on EKS and wants to leverage EKS IRSA functionalities to run Spark workloads on K8s. I had prior experience of using Kube2IAM, however now I'm planning to move to IRSA.

This error is not because of IRSA, as service accounts are getting attached perfectly fine to Driver and Executor pods and I can access S3 via CLI and SDK from both. This issue is related to accessing S3 using Spark on Spark 3.0/ Hadoop 3.2

Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext. : java.lang.NoClassDefFoundError: com/amazonaws/services/s3/model/MultiObjectDeleteException

I'm using following versions -

APACHE_SPARK_VERSION=3.0.1
HADOOP_VERSION=3.2
aws-java-sdk-1.11.890
hadoop-aws-3.2.0
Python 3.7.3

I tested with different version as well.

aws-java-sdk-1.11.563.jar

Please help to give a solution if someone has come across this issue.

PS: This is not an IAM Policy error as well, because IAM policies are perfectly fine.

score 8 · Answer 1 · answered Oct 31 '20 at 18:55

Finally all the issues are solved with below jars -

hadoop-aws-3.2.0.jar
aws-java-sdk-bundle-1.11.874.jar (https://mvnrepository.com/artifact/com.amazonaws/aws-java-sdk-bundle/1.11.874)

Anyone who's trying to run Spark on EKS using IRSA this is the correct spark config -

from pyspark.sql import SparkSession

spark = SparkSession.builder \
        .appName("pyspark-data-analysis-1") \
        .config("spark.kubernetes.driver.master","k8s://https://xxxxxx.gr7.ap-southeast-1.eks.amazonaws.com:443") \
        .config("spark.kubernetes.namespace", "jupyter") \
        .config("spark.kubernetes.container.image", "xxxxxx.dkr.ecr.ap-southeast-1.amazonaws.com/spark-ubuntu-3.0.1") \
        .config("spark.kubernetes.container.image.pullPolicy" ,"Always") \
        .config("spark.kubernetes.authenticate.driver.serviceAccountName", "spark") \
        .config("spark.kubernetes.authenticate.executor.serviceAccountName", "spark") \
        .config("spark.kubernetes.executor.annotation.eks.amazonaws.com/role-arn","arn:aws:iam::xxxxxx:role/spark-irsa") \
        .config("spark.hadoop.fs.s3a.aws.credentials.provider", "com.amazonaws.auth.WebIdentityTokenCredentialsProvider") \
        .config("spark.kubernetes.authenticate.submission.caCertFile", "/var/run/secrets/kubernetes.io/serviceaccount/ca.crt") \
        .config("spark.kubernetes.authenticate.submission.oauthTokenFile", "/var/run/secrets/kubernetes.io/serviceaccount/token") \
        .config("spark.hadoop.fs.s3a.multiobjectdelete.enable", "false") \
        .config("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") \
        .config("spark.hadoop.fs.s3a.fast.upload","true") \
        .config("spark.executor.instances", "1") \
        .config("spark.executor.cores", "3") \
        .config("spark.executor.memory", "10g") \
        .getOrCreate()

Of all the solutions and StackOverflow posts, this is the only working solution. Please accept this as the answer! — Sathiya Sarathi, Feb 18 '22 at 06:28
Thanks for it @SathyasarathiGunasekaran. If you want to use Spark 3.2.0 below are the versions I have tested successfully - ENV APACHE_SPARK_VERSION 3.2.0 ENV HADOOP_VERSION 3.2 ENV HADOOP_AWS_JAR_VERSION 3.3.1 ENV AWS_JAVA_SDK_VERSION 1.12.65 — Prateek Dubey, Feb 19 '22 at 09:33

score 2 · Answer 2 · answered Nov 01 '20 at 17:49

Can check out this blog (https://medium.com/swlh/how-to-perform-a-spark-submit-to-amazon-eks-cluster-with-irsa-50af9b26cae) with:

Spark 2.4.4
Hadoop 2.7.3
AWS SDK 1.11.834

The example spark-submit is

/opt/spark/bin/spark-submit \
    --master=k8s://https://4A5<i_am_tu>545E6.sk1.ap-southeast-1.eks.amazonaws.com \
    --deploy-mode cluster \
    --name spark-pi \
    --class org.apache.spark.examples.SparkPi \
    --conf spark.kubernetes.driver.pod.name=spark-pi-driver \
    --conf spark.kubernetes.container.image=vitamingaugau/spark:spark-2.4.4-irsa \
    --conf spark.kubernetes.namespace=spark-pi \
    --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark-pi \
    --conf spark.kubernetes.authenticate.executor.serviceAccountName=spark-pi \
    --conf spark.hadoop.fs.s3a.aws.credentials.provider=com.amazonaws.auth.WebIdentityTokenCredentialsProvider \
    --conf spark.kubernetes.authenticate.submission.caCertFile=/var/run/secrets/kubernetes.io/serviceaccount/ca.crt \
    --conf spark.kubernetes.authenticate.submission.oauthTokenFile=/var/run/secrets/kubernetes.io/serviceaccount/token \
    local:///opt/spark/examples/target/scala-2.11/jars/spark-examples_2.11-2.4.4.jar 20000

AWS EKS Spark 3.0, Hadoop 3.2 Error - NoClassDefFoundError: com/amazonaws/services/s3/model/MultiObjectDeleteException

2 Answers2

Linked