How to give dependent jars to spark submit in cluster mode

Question

I am running spark using cluster mode for deployment . Below is the command

JARS=$JARS_HOME/amqp-client-3.5.3.jar,$JARS_HOME/nscala-time_2.10-2.0.0.jar,\
$JARS_HOME/rabbitmq-0.1.0-RELEASE.jar,\
$JARS_HOME/kafka_2.10-0.8.2.1.jar,$JARS_HOME/kafka-clients-0.8.2.1.jar,\
$JARS_HOME/spark-streaming-kafka_2.10-1.4.1.jar,\
$JARS_HOME/zkclient-0.3.jar,$JARS_HOME/protobuf-java-2.4.0a.jar

dse spark-submit -v --conf "spark.serializer=org.apache.spark.serializer.KryoSerializer" \
 --executor-memory 512M \
 --total-executor-cores 3 \
 --deploy-mode "cluster" \
 --master spark://$MASTER:7077 \
 --jars=$JARS \
 --supervise \
 --class "com.testclass" $APP_JAR  input.json \
 --files "/home/test/input.json"

The above command is working fine in client mode. But when I use it in cluster mode I get class not found exception

Exception in thread "main" java.lang.reflect.InvocationTargetException
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at org.apache.spark.deploy.worker.DriverWrapper$.main(DriverWrapper.scala:58)
    at org.apache.spark.deploy.worker.DriverWrapper.main(DriverWrapper.scala)
Caused by: java.lang.NoClassDefFoundError: org/apache/spark/streaming/kafka/KafkaUtils$

In client mode the dependent jars are getting copied to the /var/lib/spark/work directory whereas in cluster mode it is not. Please help me in getting this solved.

EDIT:

I am using nfs and I have mounted the same directory on all the spark nodes under same name. Still I get the error. How it is able to pick the application jar which is also under same directory but not the dependent jars ?

score 7 · Accepted Answer · answered Dec 15 '15 at 02:30

7

In client mode the dependent jars are getting copied to the /var/lib/spark/work directory whereas in cluster mode it is not.

In Cluster mode, driver pragram is running in the cluster not in local(compared to client mode) and dependent jars should be accessible in cluster, otherwise driver program and executor will throw "java.lang.NoClassDefFoundError" exception.

Actually When using spark-submit, the application jar along with any jars included with the --jars option will be automatically transferred to the cluster.

Your extra jars could be added to --jars, they will be copied to cluster automatically.

please refer to "Advanced Dependency Management" section in below link:
http://spark.apache.org/docs/latest/submitting-applications.html

answered Dec 15 '15 at 02:30

Shawn Guo

3,169
3
21
28

I have kept all the jars in a nfs file system and I have mounted the same in all the spark nodes. – Knight71 Dec 15 '15 at 10:47
I guess they were not be included in the classpath. you can enable history server and check the application environment and its classpath. It is very useful to check "NoClassDefFoundError" exception. – Shawn Guo Dec 15 '15 at 11:20
Yes you are right.I used the dse -v and found that classpath elements to be empty in cluster deploy mode. But in client mode it is getting populated. I am not sure what I am missing here. – Knight71 Dec 15 '15 at 11:28
Good news :), I think you'd better use --jars parameter, it is recommended to distribute addtional jars to cluster mode. Besides that, I sometimes use "spark.driver.extraClassPath" to explicitly add jars to driver classpath.(because I found sometimes --jars still not be added to classpath). please have a try. – Shawn Guo Dec 15 '15 at 12:44
Thanks for the help.After struggling for two days, used --driver-class-path with dependent jars . Also found out that the jars should be colon separated instead of comma separated. – Knight71 Dec 17 '15 at 09:45
cool，seems that --driver-class-path is necessary? but looks not mentioned in document to use with jar? I have encountered this issue – Shawn Guo Dec 17 '15 at 09:56
Yes it is necessary and it is not mentioned in document. I had a look at spark submit code . The jars are added in case of client mode but it is not in cluster mode. – Knight71 Dec 17 '15 at 10:01
When i am trying to use --Jars option , but the spark is not adding the Jar to cluster automatically. Only the Jar file in which my module is compiled is moved to Cluster , but the --jars i mention is not moved. My original issue is posted in https://stackoverflow.com/questions/62434580/spark-mongo-db-connector-no-class-def-found – Sam Berchmans Jul 01 '20 at 18:22

WoodChopper · Answer 2 · 2020-07-13T20:29:32.810

1

As spark documentation says,

Keep all jars and dependencies in same local path in all nodes in cluster or
Keep the jar is distributed files system where all nodes have access to.

Spark properties

edited Jul 13 '20 at 20:29

answered Dec 14 '15 at 17:47

WoodChopper

4,265
6
31
55

Hi @WoodChopper , can you let us know how to put the Jar files in HDFS so that the executors will be able to take it ? What is the command line to do it when running Spark-submit – Sam Berchmans Jul 01 '20 at 18:10
@Sam Berchmans Keep jars in hdfs location then --conf spark.yarn.jars hdfs://mylocation. – WoodChopper Jul 13 '20 at 20:26

How to give dependent jars to spark submit in cluster mode

2 Answers2

Linked