spark-submit classNotFoundException : Unable to read from s3 in EMR

Question

I'm stuck at trying to run a Spark application written in Scala on a spark cluster using the spark-submit command. The cluster is on EMR. It's more about the way spark send jars to the driver/executors and how it modifies the classpath.

My Application is not a fat Uber JAR, instead I have all the jars in a folder and I send them over the cluster at starting using --jars option.

Spark version is 2.4.4

This is how I run my application :

JARS=$(ls -m target/pack/lib/* | tr -d ' ' | tr -d '\n')
# Note : I tried with/without including $HADOOP_JARS
HADOOP_JARS=$(ls -md /usr/lib/hadoop/lib/*.jar | tr -d ' ' | tr -d '\n')

spark-submit --master yarn \
  --class my.main.class \
  --driver-memory 2G\
  --executor-memory 3G\
  --executor-cores 2\
  --num-executors 2 \
  --conf "spark.executor.extraJavaOptions=-Dorg.bytedeco.javacpp.maxbytes=4G"\
  --conf "spark.driver.extraJavaOptions=-Dorg.bytedeco.javacpp.maxbytes=3G"\
  --conf spark.executor.memoryOverhead=4G\
  --conf spark.driver.memoryOverhead=2048\
  --jars $JARS,$HADOOP_JARS \
  target/scala-2.11/tep_dl4j_2.11-1.0.jar arg1 arg2

Then I try to create a DataFrame by reading a CSV from S3. This is where problems begin :

Exception in thread "main" java.lang.NoClassDefFoundError: org/jets3t/service/ServiceException
    at org.apache.hadoop.fs.s3native.NativeS3FileSystem.createDefaultStore(NativeS3FileSystem.java:343)
    at org.apache.hadoop.fs.s3native.NativeS3FileSystem.initialize(NativeS3FileSystem.java:333)

(I skip you all the stack trace)

Caused by: java.lang.ClassNotFoundException: org.jets3t.service.ServiceException
    at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:419)
    at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:352)
    ... 28 more

This class (org.jets3t.service.ServiceException) comes from jets3t-0.9.0.jar file which gets sent when starting cluster, according to this line :

INFO Client: Uploading resource file:/usr/lib/hadoop/lib/jets3t-0.9.0.jar -> hdfs://ip-172-31-10-119.ec2.internal:8020/user/hadoop/.sparkStaging/application_1582370799483_0001/jets3t-0.9.0.jar

Even when I go to the Spark UI, I see the jar in Classpath Entries with the mention "Added by User". But still this exception.

I've looked around in SO and ended at this answer. The only way to resolve it was to use the --driver-class-path option. Moreover, it says that --jars option "does not add the JAR to your driver/executor classpath". He is apparently right, but this is not what the documentation says. Indeed, according to the docs :

When using spark-submit, the application jar along with any jars included with the --jars option will be automatically transferred to the cluster. URLs supplied after --jars must be separated by commas. That list is included in the driver and executor classpaths. Directory expansion does not work with --jars.

Am I missing something here ? What is the purpose of just sending jars over HDFS, but not include them in the classpath ?

Thanks for your answers

EMR has jets3t installed (i just checked on a cluster); the way you deploy application (including the hadoop jars) is problematic, can you use the assembly plugin (sbt) or shaded (maven) — SQL.injection, Feb 22 '20 at 17:59
Thanks for replying. Including hadoop jars was only in my last attempt, to be sure that jets3t was actually shipped. Removing $HADOOP_JARS is not solving the issue. Another point : When I don't use s3 for save/load data, everything works just fine, and my application load classes from external jars. I know that I could make a Uber jar using assembly plugin, but I would like to understand why this fails in this particular case. — ChrisYohann, Feb 22 '20 at 18:41
Does it make a difference if you use client or cluster mode? — Jasper-M, Feb 23 '20 at 15:09
I have personally used `--jars` with `spark-shell` and they were definitely added to the classpath. — Jasper-M, Feb 23 '20 at 15:14
can you make an `echo $JARS` and echo `$HADOOP_JARS`? can you also share your mvn/sbt dependencies? — SQL.injection, Feb 23 '20 at 15:49
@Jasper-M Curiously I've just tried and indeed it works when I'm in cluster mode. I using client mode since the beginning. In cluster mode, spark driver has the same classpath as executor, and it seems that there are more jars added. I have written a small function that prints the classpath on the driver, and on executors. [Here](https://gist.github.com/ChrisYohann/d48aa1e68bd7cc6545791eaeeddbd8c4) is the result if it can help. — ChrisYohann, Feb 23 '20 at 19:53
@SQL.injection [build.sbt](https://gist.github.com/ChrisYohann/f2c1689dc5d4999b77c577ed65728017) I use the sbt pack plugin which outputs all the dependencies in the `target/pack/lib` folder. I've used `sbt assembly` but I have the same problem too. [plugins.sbt](https://gist.github.com/ChrisYohann/a357d37586bd38526cb848cd02579b1b) Here is the output of the `echo $JARS` command : [here](https://gist.github.com/ChrisYohann/4a891120057a06727ee6d38106b45e6d) I also tried a minimal snippet : [S3ReaderMain.scala](https://gist.github.com/ChrisYohann/8b201712c246eaa1b1f1d9d599195503) — ChrisYohann, Feb 23 '20 at 20:04
I was getting the same error. I don't know the reason but removing `hadoopConf.set("fs.s3.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")` from my code made it work. — Rakesh K, Mar 20 '20 at 13:50

spark-submit classNotFoundException : Unable to read from s3 in EMR

0 Answers0