How to work efficiently with SBT, Spark and "provided" dependencies?

Question

I'm building an Apache Spark application in Scala and I'm using SBT to build it. Here is the thing:

when I'm developing under IntelliJ IDEA, I want Spark dependencies to be included in the classpath (I'm launching a regular application with a main class)
when I package the application (thanks to the sbt-assembly) plugin, I do not want Spark dependencies to be included in my fat JAR
when I run unit tests through sbt test, I want Spark dependencies to be included in the classpath (same as #1 but from the SBT)

To match constraint #2, I'm declaring Spark dependencies as provided:

libraryDependencies ++= Seq(
  "org.apache.spark" %% "spark-streaming" % sparkVersion % "provided",
  ...
)

Then, sbt-assembly's documentation suggests to add the following line to include the dependencies for unit tests (constraint #3):

run in Compile <<= Defaults.runTask(fullClasspath in Compile, mainClass in (Compile, run), runner in (Compile, run))

That leaves me with constraint #1 not being full-filled, i.e. I cannot run the application in IntelliJ IDEA as Spark dependencies are not being picked up.

With Maven, I was using a specific profile to build the uber JAR. That way, I was declaring Spark dependencies as regular dependencies for the main profile (IDE and unit tests) while declaring them as provided for the fat JAR packaging. See https://github.com/aseigneurin/kafka-sandbox/blob/master/pom.xml

What is the best way to achieve this with SBT?

just to remind, when using `spark-submit`, spark will use libs from spark installation path, not the libs you packed into your assembly jar. unless you specifically told it to. the config is called ``` Both sbt and Maven have assembly plugins. When creating assembly jars, list Spark and Hadoop as provided dependencies; these need not be bundled since they are provided by the cluster manager at runtime. Once you have an assembled jar you can call the bin/spark-submit script as shown here while passing your jar.``` — linehrr, May 01 '19 at 15:22

score 25 · Answer 1 · answered Nov 07 '18 at 13:30

25

Use the new 'Include dependencies with "Provided" scope' in an IntelliJ configuration.

answered Nov 07 '18 at 13:30

Martin Tapp

3,106
3
32
39

Looks like the option has been removed in later versions – matanster Jul 29 '19 at 17:28
3

Yes this option is available in Community Edition 2019.1, you need to go to "Open edit Run/Debug configuration dialog" then "Edit configuration" – Pratik Goenka Aug 22 '19 at 03:26

score 19 · Answer 2 · edited Mar 18 '17 at 23:33

19

(Answering my own question with an answer I got from another channel...)

To be able to run the Spark application from IntelliJ IDEA, you simply have to create a main class in the src/test/scala directory (test, not main). IntelliJ will pick up the provided dependencies.

object Launch {
  def main(args: Array[String]) {
    Main.main(args)
  }
}

Thanks Matthieu Blanc for pointing that out.

edited Mar 18 '17 at 23:33

Will Humphreys

2,677
3
25
26

answered Apr 06 '16 at 15:46

Alexis Seigneurin

1,433
2
15
20

1

Could you clarify, why would it pick the dependencies this way? I have the same problem: trying to run Spark locally from Idea and package it using sbt assembly, and currently I have to manually add "provided" to the sbt build file for the latter case, otherwise Idea won't pick up these dependencies. – lizarisk Oct 05 '16 at 12:14
1

Did you still add this line to build.sbt? `run in Compile <<= Defaults.runTask(fullClasspath in Compile, mainClass in (Compile, run), runner in (Compile, run))` – autodidacticon Sep 27 '17 at 15:43
1

I didn't need to add it to run the test. – yoneal Apr 05 '18 at 15:26
How about telling sbt to use `provided` dependencies also in plain mains not mains under `test`, to solve the scenario not in IntelliJ but in a way inherent in the codebase itself? I wonder if sbt supports that and IntelliJ in turn knows to import that kind of definition from sbt. – matanster Sep 02 '19 at 05:20
One suggestion for doing that is linked from [here](https://github.com/sbt/sbt-assembly#-provided-configuration) if it helps anyone. – matanster Sep 02 '19 at 05:29

Atais · Answer 3 · 2018-03-01T16:17:03.380

4

You need to make the IntellJ work.

The main trick here is to create another subproject that will depend on the main subproject and will have all its provided libraries in compile scope. To do this I add the following lines to build.sbt:

lazy val mainRunner = project.in(file("mainRunner")).dependsOn(RootProject(file("."))).settings(
  libraryDependencies ++= spark.map(_ % "compile")
)

Now I refresh project in IDEA and slightly change previous run configuration so it will use new mainRunner module's classpath:

Works flawlessly for me.

Source: https://github.com/JetBrains/intellij-scala/wiki/%5BSBT%5D-How-to-use-provided-libraries-in-run-configurations

edited Mar 01 '18 at 16:17

answered Apr 04 '17 at 15:51

Atais

10,857
6
71
111

1

I also had to add `assembly := new File("")` to the `mainRunner` settings so the `sbt assembly` command wouldn't try to run on the `mainRunner` project. – josephpconley May 05 '17 at 12:58
2

When you say "create another subproject" what does this mean? Do I create a subfolder with a build.sbt in it? If I put that code in my main build.sbt I get `java.lang.IllegalArgumentException: requirement failed: Configurations already specified for module org.apache.spark:spark-core:2.1.1:provided` as an error. – Ash Berlin-Taylor Oct 05 '17 at 10:32
The solution works well only after setting the `scalaVersion` to be the same as declared at the top of the `build.sbt` file – Haimke Nov 07 '18 at 08:36

score 3 · Answer 4 · answered May 14 '19 at 11:29

3

For running the spark jobs, the general solution of "provided" dependencies work: https://stackoverflow.com/a/21803413/1091436

You can then run the app from either sbt, or Intellij IDEA, or anything else.

It basically boils down to this:

run in Compile := Defaults.runTask(fullClasspath in Compile, mainClass in (Compile, run), runner in (Compile, run)).evaluated,
runMain in Compile := Defaults.runMainTask(fullClasspath in Compile, runner in(Compile, run)).evaluated

answered May 14 '19 at 11:29

VasiliNovikov

9,681
4
44
62

I face the same problem and this way can work. However, when my code is in the subproject for example in the "./subproject1" folder. The annoying "NoClassDef Exception" occurs again. How can I revise the `Compile / run` respectively to cope with the subprojects condition...? – SkyOne Dec 26 '21 at 02:29

bertslike · Answer 5 · 2017-02-09T11:35:03.343

A solution based on creating another subproject for running the project locally is described here.

Basically, you would need to modifiy the build.sbt file with the following:

lazy val sparkDependencies = Seq(
  "org.apache.spark" %% "spark-streaming" % sparkVersion
)

libraryDependencies ++= sparkDependencies.map(_ % "provided")

lazy val localRunner = project.in(file("mainRunner")).dependsOn(RootProject(file("."))).settings(
   libraryDependencies ++= sparkDependencies.map(_ % "compile")
)

And then run the new subproject locally with Use classpath of module: localRunner under the Run Configuration.

Martin Tapp · Answer 6 · 2018-11-12T14:05:03.980

2

[Obsolete] See new answer "Use the new 'Include dependencies with "Provided" scope' in an IntelliJ configuration." answer.

The easiest way to add provided dependencies to debug a task with IntelliJ is to:

Right-click src/main/scala
Select Mark Directory as... > Test Sources Root

This tells IntelliJ to treat src/main/scala as a test folder for which it adds all the dependencies tagged as provided to any run config (debug/run).

Every time you do a SBT refresh, redo these step as IntelliJ will reset the folder to a regular source folder.

edited Nov 12 '18 at 14:05

answered Sep 18 '17 at 13:22

Martin Tapp

3,106
3
32
39

It is not a good solution as any change to SBT file overrides manual changes to the project. After a change in the SBT file, the `src/main/scala` goes back to be `Sources Root` – Haimke Nov 07 '18 at 08:13
There is a new check box called 'Include dependencies with "Provided" scope' in a configuration making my solution no longer required! – Martin Tapp Nov 07 '18 at 13:25

score 0 · Answer 7 · answered Apr 05 '16 at 22:21

You should be not looking at SBT for an IDEA specific setting. First of all, if the program is supposed to be run with spark-submit, how are you running it on IDEA ? I am guessing you'd be running as standalone in IDEA, while running it through spark-submit normally. If that's the case, add manually the spark libraries in IDEA, using File|Project Structure|Libraries. You'll see all dependencies listed from SBT, but you can add arbitrary jar/maven artifacts using the + (plus) sign. That should do the trick.

score -2 · Answer 8 · answered Apr 06 '16 at 05:54

Why not bypass sbt and manually add spark-core and spark-streaming as libraries to your module dependencies?

Open the Project Structure dialog (e.g. ⌘;).
In the left-hand pane of the dialog, select Modules.
In the pane to the right, select the module of interest.
In the right-hand part of the dialog, on the Module page, select the Dependencies tab.
On the Dependencies tab, click add and select Library.
In the Choose Libraries dialog, select new library, from maven
Find spark-core. Ex org.apache.spark:spark-core_2.10:1.6.1
Profit

https://www.jetbrains.com/help/idea/2016.1/configuring-module-dependencies-and-libraries.html?origin=old_help#add_existing_lib

How to work efficiently with SBT, Spark and "provided" dependencies?

8 Answers8

You need to make the IntellJ work.

Linked