I'm stuck at trying to run a Spark application written in Scala on a spark cluster using the spark-submit
command. The cluster is on EMR. It's more about the way spark send jars to the driver/executors and how it modifies the classpath.
My Application is not a fat Uber JAR, instead I have all the jars in a folder and I send them over the cluster at starting using --jars
option.
Spark version is 2.4.4
This is how I run my application :
JARS=$(ls -m target/pack/lib/* | tr -d ' ' | tr -d '\n')
# Note : I tried with/without including $HADOOP_JARS
HADOOP_JARS=$(ls -md /usr/lib/hadoop/lib/*.jar | tr -d ' ' | tr -d '\n')
spark-submit --master yarn \
--class my.main.class \
--driver-memory 2G\
--executor-memory 3G\
--executor-cores 2\
--num-executors 2 \
--conf "spark.executor.extraJavaOptions=-Dorg.bytedeco.javacpp.maxbytes=4G"\
--conf "spark.driver.extraJavaOptions=-Dorg.bytedeco.javacpp.maxbytes=3G"\
--conf spark.executor.memoryOverhead=4G\
--conf spark.driver.memoryOverhead=2048\
--jars $JARS,$HADOOP_JARS \
target/scala-2.11/tep_dl4j_2.11-1.0.jar arg1 arg2
Then I try to create a DataFrame by reading a CSV from S3. This is where problems begin :
Exception in thread "main" java.lang.NoClassDefFoundError: org/jets3t/service/ServiceException
at org.apache.hadoop.fs.s3native.NativeS3FileSystem.createDefaultStore(NativeS3FileSystem.java:343)
at org.apache.hadoop.fs.s3native.NativeS3FileSystem.initialize(NativeS3FileSystem.java:333)
(I skip you all the stack trace)
Caused by: java.lang.ClassNotFoundException: org.jets3t.service.ServiceException
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
at java.lang.ClassLoader.loadClass(ClassLoader.java:419)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)
at java.lang.ClassLoader.loadClass(ClassLoader.java:352)
... 28 more
This class (org.jets3t.service.ServiceException) comes from jets3t-0.9.0.jar
file which gets sent when starting cluster, according to this line :
INFO Client: Uploading resource file:/usr/lib/hadoop/lib/jets3t-0.9.0.jar -> hdfs://ip-172-31-10-119.ec2.internal:8020/user/hadoop/.sparkStaging/application_1582370799483_0001/jets3t-0.9.0.jar
Even when I go to the Spark UI, I see the jar in Classpath Entries with the mention "Added by User". But still this exception.
I've looked around in SO and ended at this answer. The only way to resolve it was to use the --driver-class-path option. Moreover, it says that --jars
option "does not add the JAR to your driver/executor classpath". He is apparently right, but this is not what the documentation says. Indeed, according to the docs :
When using spark-submit, the application jar along with any jars included with the --jars option will be automatically transferred to the cluster. URLs supplied after --jars must be separated by commas. That list is included in the driver and executor classpaths. Directory expansion does not work with --jars.
Am I missing something here ? What is the purpose of just sending jars over HDFS, but not include them in the classpath ?
Thanks for your answers