Spark fat jar to run multiple versions on YARN

Question

I have an older version of Spark setup with YARN that I don't want to wipe out but still want to use a newer version. I found a couple posts referring to how a fat jar can be used for this.

Many SO posts point to either maven(officially supported) or sbt to build a fat jar because it's not directly available for download. There seem to be multiple plugins to do it using maven: maven-assembly-plugin, maven-shade-plugin, onejar-maven-plugin etc.

However, I can't figure out if I really need a plugin and if so, which one and how exactly to go about it. I tried directly compiling github source using 'build/mvn' and 'build/sbt' but the 'spark-assembly_2.11-2.0.2.jar' file is just 283 bytes.

My goal is to run pyspark shell using the newer version's fat jar in a similar way as mentioned here.

score 1 · Answer 1 · edited May 23 '17 at 12:00

1

From spark version 2.0.0 creating far jar is no longer supported, you can find more information in Do we still have to make a fat jar for submitting jobs in Spark 2.0.0?

The recommended way in your case (running on YARN) is to create directory on HDFS with content of spark's jars/ directory and add this path to spark-defaults.conf:

spark.yarn.jars    hdfs:///path/too/jars/directory/on/hdfs/*.jar

Then if you run pyspark shell it will use previously uploaded libraries so it will behave exactly like fat jar from Spark 1.X.

edited May 23 '17 at 12:00

Community

1
1

answered Dec 28 '16 at 08:52

Mariusz

13,481
3
60
64

To clarify, I should download spark 2.0.2 zip from github, compile using maven and put all JARs from 'target/scala-2.11/jars' to an hdfs directory. Then change the spark-defaults.conf on the **Spark 1.x directory** to this hdfs dir and run pyspark from there? Or should this 2.x be in it's own local directory on master with the conf updated to point to the hdfs dir? In the latter case would there be any additional setup/config needed? Thanks! – noobman Dec 28 '16 at 20:13
After downloading spark build it to runnable distribution (http://spark.apache.org/docs/latest/building-spark.html#building-a-runnable-distribution). Then copy resulting `tgz` file to master, unpack it to own directory (independent from spark 1.x) and copy jars (from this unpacked tgz) to HDFS. Then change config and keep configuration for every spark distribution separated, – Mariusz Dec 28 '16 at 20:55
I ran `./dev/make-distribution.sh --name custom-spark --tgz -Phadoop-2.7 -Phive -Phive-thriftserver -Pyarn` in the directory and got a rather long [shell output](http://pastebin.com/McH86bwE) with no 'dist/' directory after. – noobman Dec 28 '16 at 21:44
Is this complete shell output? What is exit code from the build? – Mariusz Dec 29 '16 at 08:24
Yes, that's the exact shell output. I don't get/see any return code. I'm running this on MacOS and had to set JAVA_HOME to /usr/bin/java to make this run, if that's relevant. – noobman Dec 29 '16 at 10:21
It was a java env var issue, and since most commands' output was going to /dev/null it wasn't showing up. I got the tgz file. Trying it out.. – noobman Dec 30 '16 at 04:27

score 1 · Answer 2 · answered Dec 28 '16 at 09:34

The easiest solution (without changing your Spark on YARN architecture and speaking to your YARN admins) is to:

Define a library dependency on Spark 2 in your build system, be it sbt or maven.
Assemble your Spark application to create a so-called uber-jar or fatjar with Spark libraries inside.

It works and I personally tested it at least once in a project.

The only (?) downside of it is that the build process takes longer (you have to sbt assembly not sbt package) and the size of your Spark application's deployable fatjar is...well...much bigger. That also makes the deployment longer since you have to spark-submit it to YARN over the wire.

All in all, it works but takes longer (which may still be shorter than convincing your admin gods to, say forget about what is available in commercial offerings like Cloudera's CDH or Hortonworks' HDP or MapR distro).

I'm not very familiar with the build/library dependency part as I've primarily used pyspark shell and spark-submit with .py files. Wouldn't mind the slight delay/large size but how would this work for python? — noobman, Dec 28 '16 at 20:27
I think you could use `--jars` after you assemble your dependencies together (as one project for dependencies only) and `*.py` script. I think it could work (similarly to the answer you pointed out in your question with `--conf spark.yarn.jar=...`. — Jacek Laskowski, Dec 29 '16 at 08:15

Spark fat jar to run multiple versions on YARN

2 Answers2

Linked