Spark Maven and Jar Development Workflow with local and remote server

Question

So I have a very basic question about how to most effectively work with a local spark environment along with a remote server deployment and despite all of the various pieces of info about this, I still don't find any of them very clear.

I have my IntelliJ environment and dependencies in need within my pom to be able to compile and run and test with my local within intellij. Then I want to test and run against a remote server by copying over my packaged jar file via scp to then run spark-submits.

But I don't need any of the dependencies from maven within my pom file since spark-submit will just use the software on the server anyway so really I just need a jar file with the classes and keeping it very lightweight for the scp would be best. Not sure if I'm mis-understanding this but now I just need to figure out how to exclude any dependency from being added to the jar during packaging. What is the right way to do that?

Update: So I managed to create a jar with and without dependencies using the below and I could just upload the one without any dependencies to server after building but how can I build only one jar file without any dependencies rather than waiting for a larger jar with everything which I don't need anyway:

    <plugin>
        <artifactId>maven-assembly-plugin</artifactId>
        <version>3.0.0</version>
        <configuration>
            <descriptorRefs>
                <descriptorRef>jar-with-dependencies</descriptorRef>
            </descriptorRefs>
        </configuration>
        <executions>
            <execution>
                <id>make-assembly</id>
                <phase>package</phase>
                <goals>
                    <goal>single</goal>
                </goals>
            </execution>
        </executions>
    </plugin>

Creating a shaded jar file for spark is correct. If the libraries are on the server, use the correct `provided` attribute in the pom dependency. Oh, and Maven can execute SCP with a custom task — OneCricketeer, Jun 10 '17 at 17:49
Possible duplicate of [Difference between maven scope compile and provided for JAR packaging](https://stackoverflow.com/questions/6646959/difference-between-maven-scope-compile-and-provided-for-jar-packaging) — OneCricketeer, Jun 10 '17 at 17:52
thanks. but I can't use `provided` scope since I need it to compile for testing local within IDE. does that make sense? can you elaborate on how to exactly go about creating a shade for all of the spark libraries? will definitely look into wagon-ssh. didn't know about that. — horatio1701d, Jun 10 '17 at 17:55
The provided scope means they are not packaged. They still need to be downloaded locally for you to compile the code — OneCricketeer, Jun 10 '17 at 17:56
Thanks. so just to be clear. I have to shade all of the spark libraries and also use `provided` scope? — horatio1701d, Jun 10 '17 at 17:59
To the other question. https://databricks.gitbooks.io/databricks-spark-knowledge-base/content/troubleshooting/missing_dependencies_in_jar_files.html — OneCricketeer, Jun 10 '17 at 17:59

score 0 · Answer 1 · answered Jun 10 '17 at 18:10

0

Two things here.

The provided dependency scope will allow you to work locally and prevent any server provided libraries from being packaged.

Maven doesn't package external libraries without creating an uber or shaded jar.

An example of a good Spark POM is provided by Databricks

Also worth mentioning, Maven copy local file to remote server using SSH

See Maven Wagon SSH plugin

answered Jun 10 '17 at 18:10

OneCricketeer

179,855
19
132
245

I know I could use my own spark libraries when I work local and point those jars within intellij but is that the only way to compile with `provided` scope? because when I switch my spark libraries to `provided` and I try to run it within Intellij I get a class not found error: `Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/spark/sql/SparkSession$` unless I had my own spark jars as dependencies. – horatio1701d Jun 11 '17 at 12:17
You would only get that error if you copied that pom I linked to and didn't check that it's using Spark 1,whereas your error is from Spark 2 – OneCricketeer Jun 11 '17 at 12:20
definitely did not copy over the pom from databricks. I guess how do I tell intellij to use the downloaded spark jar files if the scope is `provided`? – horatio1701d Jun 11 '17 at 12:24
It already does... I have it in mine pom files and it works. As I explained, provided only affects the runtime of the code, not the compilation – OneCricketeer Jun 11 '17 at 12:25
Are you trying to run the code in intellij? In that case, then yes, you're going to have missing classes – OneCricketeer Jun 11 '17 at 12:26
yes. trying to compile and run in IntelliJ so it looks like I have to add my own classes to Intellij and then keep everything I don't want packaged as uber jar as `provided` and then I can test local. was wondering if there was a way to not use my own spark installation but I guess that makes sense. Wasn't sure what the best workflow is when testing local with Intellij and then needing to package/scp for remote cluster. – horatio1701d Jun 11 '17 at 12:30
1

You could define two separate pom files I believe, though never done it. One for local development and the other for remote – OneCricketeer Jun 11 '17 at 12:39

Spark Maven and Jar Development Workflow with local and remote server

1 Answers1