When there is a jar file needed in a spark job, it needs to be added into spark job through 2 ways:
1. --jar path
option in command.
2. SparkContext.addJar("path")
.
Can anyone tell me the difference between the these 2 ways?
From this question, the answer is they are identical and only priority is different, but I don't think it is true. If I submit the spark job in yarn cluster mode, the addJar() will not work if jar files are not included in option --jars
in command according to official site.
The --jars option allows the SparkContext.addJar function to work if you are using it with local files and running in yarn-cluster mode. It does not need to be used if you are using it with HDFS, HTTP, HTTPS, or FTP files.
The reason is that the driver runs on different machine than the client. So it seems that option --jars
in command is from client and function addJar()
can only work on jars in the driver.
Then I did a test in local mode.
1.spark-shell --master local --jars path/to/jar
If I start spark-shell in this way, object in the jar can be used in the spark-shell
2.spark-shell --master local
If I start spark-shell in this way and use sc.addJar("path/to/jar")
, objects within the jar file cannot be imported into the spark-shell and I got class cannot be found
error.
My questions are:
Why the method SparkContext.addJar()
does not work in local mode?
What is the difference between SparkContext.addJar()
and --jars
?
My environment: hortonworks 2.5 cluster and version of spark is 1.6.2. I appreciate if anyone can shed some light on that.