Spark submit does automatically upload the jar to cluster?

Question

I'm trying to submit a Spark app from local machine Terminal to my Cluster. I'm using --master yarn-cluster. I need to run the driver program on my Cluster too, not on the machine I do submit the application i.e my local machine

When I provide the path to application jar which is in my local machine, would spark-submit automatically upload it to my Cluster?

I'm using

    bin/spark-submit 
--class com.my.application.XApp 
--master yarn-cluster --executor-memory 100m 
--num-executors 50 /Users/nish1013/proj1/target/x-service-1.0.0-201512141101-assembly.jar 
1000

and getting error

Diagnostics: java.io.FileNotFoundException: File file:/Users/nish1013/proj1/target/x-service-1.0.0-201512141101- does not exist

In Documentation ,http://spark.apache.org/docs/latest/submitting-applications.html#launching-applications-with-spark-submit

Advanced Dependency Management When using spark-submit, the application jar along with any jars included with the --jars option will be automatically transferred to the cluster.

But seems like it does not !

Basically, `spark-submit` does not upload the file to the cluster (yarn or k8s) as it's outof control of spark (before spark driver is started). `--jars` transfers jars from driver to executors only (after driver is started). — Leon, May 02 '19 at 02:33

JimLohse · Answer 1 · 2015-12-31T05:45:18.647

I see you are quoting the spark-submit page from Spark Docs but I would spend a lot more time on the Running Spark on YARN page. Bottom-line, look at:

There are two deploy modes that can be used to launch Spark applications on YARN. In yarn-cluster mode, the Spark driver runs inside an application master process which is managed by YARN on the cluster, and the client can go away after initiating the application. In yarn-client mode, the driver runs in the client process, and the application master is only used for requesting resources from YARN.

Further you note, "I need to run the driver program on my Cluster too, not on the machine I do submit the application i.e my local machine"

So I agree with you that you are right to run --master yarn-cluster instead of --master yarn-client

(and one comment notes what might just be a syntax error where you dropped "assembly.jar" but I think this will apply as well...)

Some of the basic assumptions about non-YARN implementations change a lot when YARN is introduced, mostly related to Classpaths and the need to push jars to the workers.

From an email on the Apache Spark User list:

YARN cluster mode. Spark submit does upload your jars to the cluster. In particular, it puts the jars in HDFS so your driver can just read from there. As in other deployments, the executors pull the jars from the driver.

So finally, from the Apache Spark YARN doc:

Ensure that HADOOP_CONF_DIR or YARN_CONF_DIR points to the directory which contains the (client side) configuration files for the Hadoop cluster. These configs are used to write to HDFS and connect to the YARN ResourceManager.

NOTE: I only see you adding a single JAR, if there's a need to add other JARs there's a special note about doing that with YARN:

In yarn-cluster mode, the driver runs on a different machine than the client, so SparkContext.addJar won’t work out of the box with files that are local to the client. To make files on the client available to SparkContext.addJar, include them with the --jars option in the launch command.

That page in the link has some examples.

And of course you downloaded or built the YARN-specific version of Spark.

Background, in a standalone cluster deployment using spark-submit and the option --deploy-mode cluster, yes you do need to make sure every worker node has access to all the dependencies, Spark will not push them to the cluster. This is because in "standalone cluster" mode with Spark as the job manager, you don't know which node the driver will run on! But that doesn't apply to your case.

But if I could, depending on the size of the jars you are uploading, I would still explicitly put the jars on each node, or "globally available" via HDFS, for another reason from the docs:

From Advanced Dependency Management, it seems to present the best of both worlds, but also a great reason for manually pushing your jars out to all nodes:

local: - a URI starting with local:/ is expected to exist as a local file on each worker node. This means that no network IO will be incurred, and works well for large files/JARs that are pushed to each worker, or shared via NFS, GlusterFS, etc.

But I assume that local:/... would change to hdfs:/ ... not sure on that one.

Markon · Answer 2 · 2015-12-21T09:36:05.457

5

Yes and no. It depends on what you mean. Spark deploys the .jar to the nodes in the cluster. However, it won't upload your .jar file from your local machine to the cluster.

You can find more info in the Submitting Applications page. As you can see, in the arguments you pass to spark-submit, there is one that needs to be globally visible: the application-jar.

application-jar: Path to a bundled jar including your application and all dependencies. The URL must be globally visible inside of your cluster, for instance, an hdfs:// path or a file:// path that is present on all nodes.

As far as I understand, what you want is to use yarn-client, not yarn-cluster. This will run the driver in the client (e.g., the machine which you are trying to call spark-submit on, for example your laptop), without the need of copying the .jar file on the cluster. More about this here.

edited Dec 21 '15 at 09:36

answered Dec 21 '15 at 09:02

Markon

4,480
1
27
39

But I'm receiving an error Diagnostics: java.io.FileNotFoundException: pointing to my jar even the jar is available. – nish1013 Dec 21 '15 at 09:09
The URL must be globally visible inside of your cluster, for instance, an hdfs:// path or a file:// path that is present on all nodes. – Markon Dec 21 '15 at 09:11
How are you calling the spark-submit command? Could you update your question with the error message + the way you call spark-submit? And other details, if you have (how many nodes, how you setup the cluster, and so on) – Markon Dec 21 '15 at 09:11
bin/spark-submit --class com.my.application.XApp --master yarn-cluster --executor-memory 100m --num-executors 50 /Users/nish1013/proj1/target/x-service-1.0.0-201512141101-assembly.jar 1000 – nish1013 Dec 21 '15 at 09:12
Could you please update your question with the commands / error you got? – Markon Dec 21 '15 at 09:15
So I have to copy the jar to all nodes or hdfs . So that means it won't copy automatically ? – nish1013 Dec 21 '15 at 09:23
Yes, exactly. You first need to deploy the application on the cluster. Then, you need to run it with the command you executed. Spark doesn't copy the jar file from your local machine to the cluster. – Markon Dec 21 '15 at 09:25
I copied the file to HDFS , but now I'm getting Diagnostics: java.io.FileNotFoundException: File file:/Users/nis1013/Dev/spark-1.4.1-bin-hadoop2.6/lib/spark-assembly-1.4.1-hadoop2.6.0.jar does not exist – nish1013 Dec 21 '15 at 10:11
You need to use hdfs://path/to/assembly.jar on HDFS when in spark-submit. – Markon Dec 21 '15 at 10:27
yes , but this time it's not finding spark-assembly-1.4.1-hadoo‌p2.6.0.jar which is my local spark library jar , which is again already there ! (not the application jar , it does not complain anymore about it) – nish1013 Dec 21 '15 at 10:31
Yes, you need to install the dependencies / spark / hadoop on your cluster. Unless, as I said previously, you use yarn-client (but then you need to setup your local instance as a yarn-client). – Markon Dec 21 '15 at 10:33
added a new question as it is different to this thread http://stackoverflow.com/questions/34395519/erro-spark-assembly-1-4-1-hadoop2-6-0-jar-does-not-exist – nish1013 Dec 21 '15 at 12:20

score -1 · Answer 3 · answered Dec 21 '15 at 09:17

-1

Try adding --jars option before your /path/to/jar/file

spark-submit --jars /tmp/test.jar

answered Dec 21 '15 at 09:17

noorul

1,283
1
8
18

There is also a --driver-class-path option. Can you try that with colon separated paths to jar if you have more than one? – noorul Dec 22 '15 at 04:19
2

If you look at the error I see incomplete path "Diagnostics: java.io.FileNotFoundException: File file:/Users/nish1013/proj1/target/x-service-1.0.0-201512141101- does not exist" but in the command line you specified /Users/nish1013/proj1/target/x-service-1.0.0-201512141101-assembly.jar – noorul Dec 22 '15 at 04:20

Spark submit does automatically upload the jar to cluster?

3 Answers3

Linked