6

In Submitting Applications in the Spark docs, as of 1.6.0 and earlier, it's not clear how to specify the --jars argument, as it's apparently not a colon-separated classpath not a directory expansion.

The docs say "Path to a bundled jar including your application and all dependencies. The URL must be globally visible inside of your cluster, for instance, an hdfs:// path or a file:// path that is present on all nodes."

Question: What are all the options for submitting a classpath with --jars in the spark-submit script in $SPARK_HOME/bin? Anything undocumented that could be submitted as an improvement for docs?

I ask because when I was testing --jars today, we had to explicitly provide a path to each jar:

/usr/local/spark/bin/spark-submit --class jpsgcs.thold.PipeLinkageData ---jars=local:/usr/local/spark/jars/groovy-all-2.3.3.jar,local:/usr/local/spark/jars/guava-14.0.1.jar,local:/usr/local/spark/jars/jopt-simple-4.6.jar,local:/usr/local/spark/jars/jpsgcs-core-1.0.8-2.jar,local:/usr/local/spark/jars/jpsgcs-pipe-1.0.6-7.jar /usr/local/spark/jars/thold-0.0.1-1.jar

We are choosing to pre-populate the cluster with all the jars in /usr/local/spark/jars on each worker, it seemed that if no local:/ file:/ or hdfs: was supplied, then the default is file:/ and the driver makes the jars available on a webserver run by the driver. I chose local, as above.

And it seems that we do not need to put the main jar in the --jars argument, I have not tested yet if other classes in the final argument (application-jar arg per docs, i.e. /usr/local/spark/jars/thold-0.0.1-1.jar) are shipped to workers, or if I need to put the application-jar in the --jars path to get classes not named after --class to be seen.

(And granted with Spark standalone mode using --deploy-mode client, you also have to put a copy of the driver on each worker but you don't know up front which worker will run the driver)

JimLohse
  • 1,209
  • 4
  • 19
  • 44
  • Wanted to provide a clear answer to this, per my comment and a response on Josh Rosen's answer here: http://stackoverflow.com/questions/24855368/spark-throws-classnotfoundexception-when-using-jars-option/24968221?noredirect=1#comment57212105_24968221 – JimLohse Jan 12 '16 at 08:05

2 Answers2

8

In this way it worked easily.. instead of specifying each jar with version separately..

#!/bin/sh
# build all other dependent jars in OTHER_JARS

JARS=`find ../lib -name '*.jar'`
OTHER_JARS=""
   for eachjarinlib in $JARS ; do    
if [ "$eachjarinlib" != "APPLICATIONJARTOBEADDEDSEPERATELY.JAR" ]; then
       OTHER_JARS=$eachjarinlib,$OTHER_JARS
fi
done
echo ---final list of jars are : $OTHER_JARS
echo $CLASSPATH

spark-submit --verbose --class <yourclass>
... OTHER OPTIONS
--jars $OTHER_JARS,APPLICATIONJARTOBEADDEDSEPERATELY.JAR
  • Using tr unix command also can help like the below example.

    --jars $(echo /dir_of_jars/*.jar | tr ' ' ',')

Ram Ghadiyaram
  • 28,239
  • 13
  • 95
  • 121
  • This is a nice workaround, [for Spark 1.6.1 and later, when available, the comma list requirement should be documented more directly](http://spark.apache.org/docs/latest/submitting-applications.html#advanced-dependency-management). I went ahead and accepted this. Just to be sure, this works as is? The line `OTHER_JARS=$eachjarinlib,$OTHER_JARS` doesn't need any quotes? I guess the shell is not going to expand that comma out and therefore takes it just to be a string, but it might be safer to double quote it? `OTHER_JARS="$eachjarinlib,$OTHER_JARS"`? – JimLohse Feb 22 '16 at 14:22
  • 1
    Jim, Yes doesn't need double quotes. This works(spark 1.3,1.5 versions) as it is. – Ram Ghadiyaram Feb 23 '16 at 06:39
  • 4
    this is another option which is simpler --jars $(echo /dir/of/jars/*.jar | tr ' ' ',') – Ram Ghadiyaram Feb 29 '16 at 07:11
0

One way (the only way?) to use the --jars argument is to supply a comma-separated list of explicitly named jars. The only way I figured out to use the commas was a StackOverflow answer that led me to look beyond the docs to the command line:

spark-submit --help 

The output from that command contains:

 --jars JARS                 Comma-separated list of local jars to include on the driver
                              and executor classpaths. 

Today when I was testing --jars, we had to explicitly provide a path to each jar:

/usr/local/spark/bin/spark-submit --class jpsgcs.thold.PipeLinkageData ---jars=local:/usr/local/spark/jars/groovy-all-2.3.3.jar,local:/usr/local/spark/jars/guava-14.0.1.jar,local:/usr/local/spark/jars/jopt-simple-4.6.jar,local:/usr/local/spark/jars/jpsgcs-core-1.0.8-2.jar,local:/usr/local/spark/jars/jpsgcs-pipe-1.0.6-7.jar /usr/local/spark/jars/thold-0.0.1-1.jar
JimLohse
  • 1,209
  • 4
  • 19
  • 44