3

When there is a jar file needed in a spark job, it needs to be added into spark job through 2 ways:
1. --jar path option in command.
2. SparkContext.addJar("path").
Can anyone tell me the difference between the these 2 ways?
From this question, the answer is they are identical and only priority is different, but I don't think it is true. If I submit the spark job in yarn cluster mode, the addJar() will not work if jar files are not included in option --jars in command according to official site.

The --jars option allows the SparkContext.addJar function to work if you are using it with local files and running in yarn-cluster mode. It does not need to be used if you are using it with HDFS, HTTP, HTTPS, or FTP files.

The reason is that the driver runs on different machine than the client. So it seems that option --jars in command is from client and function addJar() can only work on jars in the driver.

Then I did a test in local mode.

1.spark-shell --master local --jars path/to/jar

If I start spark-shell in this way, object in the jar can be used in the spark-shell

2.spark-shell --master local

If I start spark-shell in this way and use sc.addJar("path/to/jar"), objects within the jar file cannot be imported into the spark-shell and I got class cannot be found error.

My questions are:

Why the method SparkContext.addJar() does not work in local mode?

What is the difference between SparkContext.addJar() and --jars?

My environment: hortonworks 2.5 cluster and version of spark is 1.6.2. I appreciate if anyone can shed some light on that.

Community
  • 1
  • 1
Frank Kong
  • 1,010
  • 1
  • 20
  • 32

1 Answers1

6

Well, after some research, I found the reason. Just post here if there is someone else involved into this problem.

Method addJar() does not add jars into driver's classpath. What the method does is to find jars in driver node, distribute into worker nodes and then add into executors' classpath.
Because I submit my spark job in local mode, driver classpath (I guess) is used in the spark job, the jars added by method addJar() cannot be found.

In order to solve this problem, use --jars option to include all jars when submit the spark job or use --driver-class-path to add jars.
More details can be found here.

Frank Kong
  • 1,010
  • 1
  • 20
  • 32