Add JAR files to a Spark job - spark-submit

Question

True... it has been discussed quite a lot.

However, there is a lot of ambiguity and some of the answers provided ... including duplicating JAR references in the jars/executor/driver configuration or options.

The ambiguous and/or omitted details

The following ambiguity, unclear, and/or omitted details should be clarified for each option:

How ClassPath is affected
- Driver
- Executor (for tasks running)
- Both
- not at all
Separation character: comma, colon, semicolon
If provided files are automatically distributed
- for the tasks (to each executor)
- for the remote Driver (if ran in cluster mode)
type of URI accepted: local file, HDFS, HTTP, etc.
If copied into a common location, where that location is (HDFS, local?)

The options which it affects:

--jars
SparkContext.addJar(...) method
SparkContext.addFile(...) method
--conf spark.driver.extraClassPath=... or --driver-class-path ...
--conf spark.driver.extraLibraryPath=..., or --driver-library-path ...
--conf spark.executor.extraClassPath=...
--conf spark.executor.extraLibraryPath=...
not to forget, the last parameter of the spark-submit is also a .jar file.

I am aware where I can find the main Apache Spark documentation, and specifically about how to submit, the options available, and also the JavaDoc. However, that left for me still quite some holes, although it was answered partially too.

I hope that it is not all that complex, and that someone can give me a clear and concise answer.

If I were to guess from documentation, it seems that --jars, and the SparkContext addJar and addFile methods are the ones that will automatically distribute files, while the other options merely modify the ClassPath.

Would it be safe to assume that for simplicity, I can add additional application JAR files using the three main options at the same time?

spark-submit --jar additional1.jar,additional2.jar \
  --driver-library-path additional1.jar:additional2.jar \
  --conf spark.executor.extraLibraryPath=additional1.jar:additional2.jar \
  --class MyClass main-application.jar

I found a nice article on an answer to another posting. However, nothing new was learned. The poster does make a good remark on the difference between a local driver (yarn-client) and remote driver (yarn-cluster). It is definitely important to keep in mind.

Which cluster manager are you running under? Standalone/YARN/Mesos? — Yuval Itzchakov, May 10 '16 at 15:12
Any. I intend this as a clarification to the original documentation. I am mostly using standalone cluster, single instance, yarn-client, yarn-cluster. Others might be using Mesos. It seems you did some good original research on [your blog](http://asyncified.io) on this. I ended up doing mostly the same as you - using a shader to create a Uber jar to simplify my deployment process. — YoYo, May 10 '16 at 15:15
@Yuval Itzchakov, just like Yoyo mentioned, I too use a shaded jar to bundle up all my dependencies e.g. case classes, and other jars that I may be using. I am trying to understand when would I run into a situation where I need multiple jars. I mean I can always bundle those multiple jars into 1 uber jar. Why can't I continue to live with my shaded jars bundling up all my dependencies? — Sheel Pancholi, Nov 15 '19 at 14:33
the multijar uber bundle is just not practicle in multi-user environments where some of these users are really not that expert and would be just interested in running the Python logic without really knowing which snowflake jdbc jar file should be added to the bundle. This is where the SRE comes in. — Mário de Sá Vera, May 31 '21 at 11:31

score 234 · Accepted Answer · edited Jan 27 '22 at 19:42

234

ClassPath:

ClassPath is affected depending on what you provide. There are a couple of ways to set something on the classpath:

spark.driver.extraClassPath or it's alias --driver-class-path to set extra classpaths on the node running the driver.
spark.executor.extraClassPath to set extra class path on the Worker nodes.

If you want a certain JAR to be effected on both the Master and the Worker, you have to specify these separately in BOTH flags.

Separation character:

Following the same rules as the JVM:

Linux: A colon, :
- e.g: --conf "spark.driver.extraClassPath=/opt/prog/hadoop-aws-2.7.1.jar:/opt/prog/aws-java-sdk-1.10.50.jar"
Windows: A semicolon, ;
- e.g: --conf "spark.driver.extraClassPath=/opt/prog/hadoop-aws-2.7.1.jar;/opt/prog/aws-java-sdk-1.10.50.jar"

File distribution:

This depends on the mode which you're running your job under:

Client mode - Spark fires up a Netty HTTP server which distributes the files on start up for each of the worker nodes. You can see that when you start your Spark job:

16/05/08 17:29:12 INFO HttpFileServer: HTTP File server directory is /tmp/spark-48911afa-db63-4ffc-a298-015e8b96bc55/httpd-84ae312b-5863-4f4c-a1ea-537bfca2bc2b
16/05/08 17:29:12 INFO HttpServer: Starting HTTP Server
16/05/08 17:29:12 INFO Utils: Successfully started service 'HTTP file server' on port 58922.
16/05/08 17:29:12 INFO SparkContext: Added JAR /opt/foo.jar at http://***:58922/jars/com.mycode.jar with timestamp 1462728552732
16/05/08 17:29:12 INFO SparkContext: Added JAR /opt/aws-java-sdk-1.10.50.jar at http://***:58922/jars/aws-java-sdk-1.10.50.jar with timestamp 1462728552767

Cluster mode - In cluster mode Spark selected a leader Worker node to execute the Driver process on. This means the job isn't running directly from the Master node. Here, Spark will not set an HTTP server. You have to manually make your JAR files available to all the worker nodes via HDFS, S3, or Other sources which are available to all nodes.

Accepted URI's for files

In "Submitting Applications", the Spark documentation does a good job of explaining the accepted prefixes for files:

When using spark-submit, the application jar along with any jars included with the --jars option will be automatically transferred to the cluster. Spark uses the following URL scheme to allow different strategies for disseminating jars:

file: - Absolute paths and file:/ URIs are served by the driver’s HTTP file server, and every executor pulls the file from the driver HTTP server.

hdfs:, http:, https:, ftp: - these pull down files and JARs from the URI as expected

local: - a URI starting with local:/ is expected to exist as a local file on each worker node. This means that no network IO will be incurred, and works well for large files/JARs that are pushed to each worker, or shared via NFS, GlusterFS, etc.

Note that JARs and files are copied to the working directory for each SparkContext on the executor nodes.

As noted, JAR files are copied to the working directory for each Worker node. Where exactly is that? It is usually under /var/run/spark/work, you'll see them like this:

drwxr-xr-x    3 spark spark   4096 May 15 06:16 app-20160515061614-0027
drwxr-xr-x    3 spark spark   4096 May 15 07:04 app-20160515070442-0028
drwxr-xr-x    3 spark spark   4096 May 15 07:18 app-20160515071819-0029
drwxr-xr-x    3 spark spark   4096 May 15 07:38 app-20160515073852-0030
drwxr-xr-x    3 spark spark   4096 May 15 08:13 app-20160515081350-0031
drwxr-xr-x    3 spark spark   4096 May 18 17:20 app-20160518172020-0032
drwxr-xr-x    3 spark spark   4096 May 18 17:20 app-20160518172045-0033

And when you look inside, you'll see all the JAR files you deployed along:

[*@*]$ cd /var/run/spark/work/app-20160508173423-0014/1/
[*@*]$ ll
total 89988
-rwxr-xr-x 1 spark spark   801117 May  8 17:34 awscala_2.10-0.5.5.jar
-rwxr-xr-x 1 spark spark 29558264 May  8 17:34 aws-java-sdk-1.10.50.jar
-rwxr-xr-x 1 spark spark 59466931 May  8 17:34 com.mycode.code.jar
-rwxr-xr-x 1 spark spark  2308517 May  8 17:34 guava-19.0.jar
-rw-r--r-- 1 spark spark      457 May  8 17:34 stderr
-rw-r--r-- 1 spark spark        0 May  8 17:34 stdout

Affected options:

The most important thing to understand is priority. If you pass any property via code, it will take precedence over any option you specify via spark-submit. This is mentioned in the Spark documentation:

Any values specified as flags or in the properties file will be passed on to the application and merged with those specified through SparkConf. Properties set directly on the SparkConf take highest precedence, then flags passed to spark-submit or spark-shell, then options in the spark-defaults.conf file

So make sure you set those values in the proper places, so you won't be surprised when one takes priority over the other.

Let’s analyze each option in the question:

--jars vs SparkContext.addJar: These are identical. Only one is set through Spark submit and one via code. Choose the one which suits you better. One important thing to note is that using either of these options does not add the JAR file to your driver/executor classpath. You'll need to explicitly add them using the extraClassPath configuration on both.
SparkContext.addJar vs SparkContext.addFile: Use the former when you have a dependency that needs to be used with your code. Use the latter when you simply want to pass an arbitrary file around to your worker nodes, which isn't a run-time dependency in your code.
--conf spark.driver.extraClassPath=... or --driver-class-path: These are aliases, and it doesn't matter which one you choose
--conf spark.driver.extraLibraryPath=..., or --driver-library-path ... Same as above, aliases.
--conf spark.executor.extraClassPath=...: Use this when you have a dependency which can't be included in an über JAR (for example, because there are compile time conflicts between library versions) and which you need to load at runtime.
--conf spark.executor.extraLibraryPath=... This is passed as the java.library.path option for the JVM. Use this when you need a library path visible to the JVM.

Would it be safe to assume that for simplicity, I can add additional application jar files using the 3 main options at the same time:

You can safely assume this only for Client mode, not Cluster mode. As I've previously said. Also, the example you gave has some redundant arguments. For example, passing JAR files to --driver-library-path is useless. You need to pass them to extraClassPath if you want them to be on your classpath. Ultimately, when you deploy external JAR files on both the driver and the worker is, you want:

spark-submit --jars additional1.jar,additional2.jar \
  --driver-class-path additional1.jar:additional2.jar \
  --conf spark.executor.extraClassPath=additional1.jar:additional2.jar \
  --class MyClass main-application.jar

edited Jan 27 '22 at 19:42

Peter Mortensen

30,738
21
105
131

answered May 20 '16 at 13:40

Yuval Itzchakov

146,575
32
257
321

4

Great and comprehensive answer. Thank you. Could you also tell more about best practices in deployment with *uber JAR* vs. *dependencies outside JAR* (libs in external folder and listed in `MANIFEST.MF` file)? – jsosnowski Aug 25 '16 at 09:46
@jsosnowski If you think about automated deployments and avoiding conflicts ... Why would you consider *not* using an uber jar? – YoYo Aug 26 '16 at 02:38
2

@jsosnowski Usually, I only defer to using external jars when there are conflicts that are very complex to solve with my uber JAR. I usually get by simply by using SBTs `assemblyMergeStrategy` and selecting the classes I need if there are conflicts. I'd generally recommend the same. – Yuval Itzchakov Aug 26 '16 at 13:07
13

@yuval-itzchakov Thanks for the great answer, very helpful. One point I would like to emphasize to help others who may have made the same mistake as I did. The --jars argument only transports the jars to each machine in the cluster. It does NOT tell spark to use them in the class path search. The --driver-class-path (or similar arguments or config parameters) are also required. I initially thought they were alternate ways of doing the same thing. – Tim Ryan Sep 16 '16 at 16:46
1

@TimRyan Definitely. If you look at the last part of the answer, I pass jars both to the `--jars` flag *and* the driver/executor class path. – Yuval Itzchakov Sep 16 '16 at 18:23
Thanks for comprehensive answer. What is the example of `local:/` URI for `--jars` option? I use `local:///mnt/dir/file.jar` and it gets converted to `"file:/mnt/dir/file.jar": "Added By User"` in spark application log, in section "Classpath Entries". Eventually it does not work. Apart from that, your last command in the answer is wrong, it should say `--jars`, not `--jar`. – Mike Dec 21 '16 at 10:58
@Mike What do you mean by *"it doesn't work"*? What are you trying to do? – Yuval Itzchakov Dec 22 '16 at 07:08
I was trying to make Zeppelin in EMR see my jars. Initially I tried `spark.jars` which gets translated to `--jars` as described [here](https://zeppelin.apache.org/docs/latest/interpreter/spark.html#2-loading-spark-properties). With that, my jars were listed in Spark logs as I described above but no classes were visible to Spark. I tried many options in Interpreter tab of Zeppelin and none of these worked. In despair I thought that my URI format is wrong and wrote that comment. – Mike Dec 22 '16 at 13:09
1

Eventually I found [how to inject](http://apache-zeppelin-users-incubating-mailing-list.75479.x6.nabble.com/Small-tips-when-running-Zeppelin-on-EMR-td3062.html) environment variables into `zeppelin-env.sh` and added `--jars` to `SPARK_SUBMIT_OPTIONS`. That worked. URI format I use is `--jars=local:///mnt/dir/file.jar`. – Mike Dec 22 '16 at 13:13
@YuvalItzchakov: Thanks for the detailed answer. I am currently submitting my job as an Uber/Fat JAR, I would like to leverage existing (default) dependencies of Spark (e.g. api.Java.*). How do we do it without an _uber_ JAR? Does Spark automatically pick up default JARs (spark-assembly-1.6.1-hadoop2.3.0 etc.)? – CᴴᴀZ Feb 28 '17 at 13:25
1

@CᴴᴀZ Spark has jars deployed in it's `jars` directory when you install it, if that's what you're after. Not sure what you mean by *How do we do it without an uber JAR*, not sure what *it* refers to. – Yuval Itzchakov Feb 28 '17 at 13:31
@YuvalItzchakov **Spark Standalone** does have the JARs, but was unable to find such JARs in Enterprise distributions (Hortonworks Data Platform to be specific). Answering your question: _it_ refers to _deployment_, currently am using _uber_ JAR for _it_. – CᴴᴀZ Mar 02 '17 at 10:06
Hello Yuval, thank you for your answer, it really helps me. As you said `addJar()` does not add jars into classpath. In order to to use the jar, we have to move jars to each node and declare them in classpath for both driver and executor explicitly. What is the use of `addJar()` then? How to use imported jars that are transferred by method `addJar`? Thx – Frank Kong Mar 22 '17 at 16:15
1

@Frankie The purpose of `addJar` is that Spark will be able to distribute JARs which aren't packaged as part of your uber JAR to the Worker nodes. Otherwise, it won't. – Yuval Itzchakov Mar 22 '17 at 16:50
@YuvalItzchakov I get your point but what I am wondering is how to use these jars distributed by Spark? I still get the classNotFound problem after I use addJar() to distribute jars. – Frank Kong Mar 22 '17 at 17:00
@YuvalItzchakov Thanks for your very useful and detailed post. Any comment on my [question](http://stackoverflow.com/questions/42967472/adding-a-jar-file-to-pyspark-after-context-is-created/42968126#42968126). I dont have access to the command line and do not manage the creation of the SparkSession so I need to use the 'addJar' method but need also to have the class available from my driver node in Python. – tog Mar 24 '17 at 05:10
@Frankie Did you add them to your executor classpath? – Yuval Itzchakov Mar 28 '17 at 07:58
@YuvalItzchakov I did not, now I get it. – Frank Kong Mar 28 '17 at 14:56
Great answer! @YuvalItzchakov Would you mind looking what is wrong with this setup: `spark-submit --jars http://XXX.XXX.XX.XX/path/to/xgboost4j-spark-0.8-SNAPSHOT-jar-with-dependencies.jar --driver-class-path http://XXX.XXX.XX.XX/path/to/xgboost4j-spark-0.8-SNAPSHOT-jar-with-dependencies.jar --conf spark.executor.extraClassPath=http://XXX.XXX.XX.XX/path/to/xgboost4j-spark-0.8-SNAPSHOT-jar-with-dependencies.jar --class XGBoostOnCluster --master mesos://YYY.YY.Y.YY:7077 --deploy-mode cluster http://XXX.XXX.XX.XX/path/to/project-0.1.0-SNAPSHOT.jar` It gives `java.lang.NoClassDefFoundError` – astro_asz Feb 23 '18 at 13:45
Hi, I tried adding the following statements in my scala class but it seems spark is intermittently picking up the older version of jar 0.1.52 during runtime when I am expecting 0.1.55. Am I missing something or doing anything incorrectly here?`sparkContext.addJar("s3://temp/jsch-0.1.55.jar") sparkSession.conf.set("spark.driver.extraClassPath", "s3://temp/jsch-0.1.55.jar") sparkSession.conf.set("spark.executor.extraClassPath", "s3://temp/jsch-0.1.55.jar")` – Dwarrior Mar 16 '19 at 03:15
@TimRyan / Yuval - "One important thing to note is that using either of these options does not add the JAR to your driver/executor classpath" Is that still true? In Spark 2.4.4 `spark-submit --help` shows the following text for `--jars`: "Comma-separated list of jars to include on the driver and executor classpaths." – Nick Chammas Jan 17 '20 at 17:22
@NickChammas As far as I recall that's for client mode, not cluster mode. – Yuval Itzchakov Jan 18 '20 at 07:25
The last example is confusing. For `--jars additional1.jar,additional2.jar`, where should I place the 2 jars before submitting? And `--conf spark.executor.extraClassPath=additional1.jar:additional2.jar`, what is the actual path of the 2 jars at runtime, do I need to care? – Dagang Jan 22 '21 at 18:50
It seems to mi that this explanation should be for Spark 1 and is currently outdated for Spark 2 and 3. Multiple reference sites using extra JARS for accessing external data-sources as HIve or HWC only include the "--jars" options without needing the extraClassPath options. For examples check these: * https://jaceklaskowski.gitbooks.io/mastering-spark-sql/content/hive/ * https://docs.cloudera.com/HDPDocuments/HDP3/HDP-3.1.4/integrating-hive/content/hive_submit_a_hivewarehouseconnector_python.html – Luis Vazquez Mar 13 '21 at 13:19
I have checked the docs and I can confirm that with Spark 2.x and above, the jars included in the --jars option are also included in the classpath: ```When using spark-submit, the application jar along with any jars included with the --jars option will be automatically transferred to the cluster. URLs supplied after --jars must be separated by commas. That list is included in the driver and executor classpaths... ``` – Luis Vazquez Mar 13 '21 at 13:36
@LuisVazquez Then the behavior has most likely changed since I wrote this answer. Will update. – Yuval Itzchakov Mar 13 '21 at 19:17

score 6 · Answer 2 · edited Jan 27 '22 at 19:33

6

Another approach in Apache Spark 2.1.0 is to use --conf spark.driver.userClassPathFirst=true during spark-submit which changes the priority of the dependency load, and thus the behavior of the spark-job, by giving priority to the JAR files the user is adding to the class-path with the --jars option.

edited Jan 27 '22 at 19:33

Peter Mortensen

30,738
21
105
131

answered Apr 03 '17 at 19:01

Stanislav

2,629
1
29
38

3

You will have to be careful with that - as it is possible to break spark doing so. This should be a last option solution. Potentially it could interfere with the layer interfacing with yarn as well when using in yarn-client mode, although I am not sure. – YoYo Apr 04 '17 at 02:33
Thank you for the heads up. Is there any way to prioritize only 1 jar, that definitely exists in the server in an older version but you cannot physically replace and you know you do not want to use? – Stanislav Apr 04 '17 at 14:49
1

I guess in that case you could try exactly as you suggested. Didn't say it was an absolute no. Also mind that the option is marked as 'experimental' - a warning to be heeded! There is no safe way of prioritizing one version of a library over another. In some implementations, this is solved by moving one of the libraries in a different namespace, so you can use both versions at the same time. – YoYo Apr 04 '17 at 18:01

score 4 · Answer 3 · edited Jan 27 '22 at 19:47

There is a restriction on using --jars: if you want to specify a directory for the location of jar/xml files, it doesn't allow directory expansions. This means if you need to specify an absolute path for each JAR file.

If you specify --driver-class-path and you are executing in yarn cluster mode, then the driver class doesn't get updated. We can verify if the class path is updated or not under the Spark UI or Spark history server under the tab environment.

The option which worked for me to pass JAR files which contains directory expansions and which worked in yarn cluster mode was the --conf option. It's better to pass the driver and executor class paths as --conf, which adds them to the Spark session object itself and those paths are reflected in the Spark configuration. But please make sure to put JAR files on the same path across the cluster.

spark-submit \
  --master yarn \
  --queue spark_queue \
  --deploy-mode cluster    \
  --num-executors 12 \
  --executor-memory 4g \
  --driver-memory 8g \
  --executor-cores 4 \
  --conf spark.ui.enabled=False \
  --conf spark.driver.extraClassPath=/usr/hdp/current/hbase-master/lib/hbase-server.jar:/usr/hdp/current/hbase-master/lib/hbase-common.jar:/usr/hdp/current/hbase-master/lib/hbase-client.jar:/usr/hdp/current/hbase-master/lib/zookeeper.jar:/usr/hdp/current/hbase-master/lib/hbase-protocol.jar:/usr/hdp/current/spark2-thriftserver/examples/jars/scopt_2.11-3.3.0.jar:/usr/hdp/current/spark2-thriftserver/examples/jars/spark-examples_2.10-1.1.0.jar:/etc/hbase/conf \
  --conf spark.hadoop.mapred.output.dir=/tmp \
  --conf spark.executor.extraClassPath=/usr/hdp/current/hbase-master/lib/hbase-server.jar:/usr/hdp/current/hbase-master/lib/hbase-common.jar:/usr/hdp/current/hbase-master/lib/hbase-client.jar:/usr/hdp/current/hbase-master/lib/zookeeper.jar:/usr/hdp/current/hbase-master/lib/hbase-protocol.jar:/usr/hdp/current/spark2-thriftserver/examples/jars/scopt_2.11-3.3.0.jar:/usr/hdp/current/spark2-thriftserver/examples/jars/spark-examples_2.10-1.1.0.jar:/etc/hbase/conf \
  --conf spark.hadoop.mapreduce.output.fileoutputformat.outputdir=/tmp

score 3 · Answer 4 · edited Jan 27 '22 at 19:36

Other configurable Spark option relating to JAR files and classpath, in case of yarn as deploy mode are as follows.

From the Spark documentation,

spark.yarn.jars

List of libraries containing Spark code to distribute to YARN containers. By default, Spark on YARN will use Spark jars installed locally, but the Spark jars can also be in a world-readable location on HDFS. This allows YARN to cache it on nodes so that it doesn't need to be distributed each time an application runs. To point to jars on HDFS, for example, set this configuration to hdfs:///some/path. Globs are allowed.

spark.yarn.archive

An archive containing needed Spark jars for distribution to the YARN cache. If set, this configuration replaces spark.yarn.jars and the archive is used in all the application's containers. The archive should contain jar files in its root directory. Like with the previous option, the archive can also be hosted on HDFS to speed up file distribution.

Users can configure this parameter to specify their JAR files, which in turn gets included in Spark driver's classpath.

score 2 · Answer 5 · edited Jan 27 '22 at 19:51

When using spark-submit with --master yarn-cluster, the application JAR file along with any JAR file included with the --jars option will be automatically transferred to the cluster. URLs supplied after --jars must be separated by commas. That list is included in the driver and executor classpaths

Example:

spark-submit --master yarn-cluster --jars ../lib/misc.jar, ../lib/test.jar --class MainClass MainApp.jar

Reference

Submitting Applications

score 1 · Answer 6 · answered Mar 22 '22 at 04:56

1

I know that adding jar with --jars option automatically adds it to classpath as well.

https://spark.apache.org/docs/3.2.1/submitting-applications.html

That list is included in the driver and executor classpaths.

answered Mar 22 '22 at 04:56

Heedo Lee

59
3

1

Yeah. I think the top answer is outdated. – Daniel Darabos Oct 12 '22 at 12:29

score 0 · Answer 7 · edited Jan 27 '22 at 19:46

0

While we submit Apache Spark jobs using the spark-submit utility, there is an option, --jars . Using this option, we can pass the JAR file to Spark applications.

edited Jan 27 '22 at 19:46

Peter Mortensen

30,738
21
105
131

answered May 07 '19 at 10:54

bala

1
1

2

That there is this `—jar` option was mentioned by the original poster, plus discussed in much more detail by more than one answer. It does not appear that you are providing anything new? – YoYo May 20 '19 at 02:22

Add JAR files to a Spark job - spark-submit

The ambiguous and/or omitted details

The options which it affects:

7 Answers7

ClassPath:

Separation character:

File distribution:

Accepted URI's for files

Affected options:

spark.yarn.jars

spark.yarn.archive

Reference

Linked

Related