SBT run with provided works under the '.' projects but fails with no mercy under any subprojects

Question

I'm working with latest sbt.version=1.5.7.

My assembly.sbt is nothing more than addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "1.1.0") .

I have to work with a subprojects due to requirement need.

I am facing the Spark dependencies with provided scope similar to this post: How to work efficiently with SBT, Spark and "provided" dependencies?

As the above post said, I can manage to Compile / run under the root project but fails when Compile / run in the subproject.

Here's my build.sbt detail:

val deps = Seq(
  "org.apache.spark" %% "spark-sql" % "3.1.2" % "provided",
  "org.apache.spark" %% "spark-core" % "3.1.2" % "provided",
  "org.apache.spark" %% "spark-mllib" % "3.1.2" % "provided",
  "org.apache.spark" %% "spark-avro" % "3.1.2" % "provided",
)

val analyticsFrameless =
  (project in file("."))
    .aggregate(sqlChoreography, impressionModelEtl)
    .settings(
      libraryDependencies ++= deps
    )

lazy val sqlChoreography =
  (project in file("sql-choreography"))
    .settings(libraryDependencies ++= deps)

lazy val impressionModelEtl =
  (project in file("impression-model-etl"))
    // .dependsOn(analytics)
    .settings(
      libraryDependencies ++= deps ++ Seq(
        "com.google.guava" % "guava" % "30.1.1-jre",
        "io.delta" %% "delta-core" % "1.0.0",
        "com.google.cloud.bigdataoss" % "gcs-connector" % "hadoop2-2.1.3"
      )
    )

Compile / run := Defaults
  .runTask(
    Compile / fullClasspath,
    Compile / run / mainClass,
    Compile / run / runner
  )
  .evaluated

impressionModelEtl / Compile / run := Defaults
  .runTask(
    impressionModelEtl / Compile / fullClasspath,
    impressionModelEtl / Compile / run / mainClass,
    impressionModelEtl / Compile / run / runner
  )
  .evaluated

After I execute impressionModelEtl / Compile / run with a simple program:

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._

object SparkRead {
  def main(args: Array[String]): Unit = {
    val spark =
      SparkSession
        .builder()
        .master("local[*]")
        .appName("SparkReadTestProvidedScope")
        .getOrCreate()
    spark.stop()
  }
}

, it returns

[error] java.lang.NoClassDefFoundError: org/apache/spark/sql/SparkSession$
[error]         at SparkRead$.main(SparkRead.scala:7)
[error]         at SparkRead.main(SparkRead.scala)
[error]         at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
[error]         at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
[error]         at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
[error]         at java.base/java.lang.reflect.Method.invoke(Method.java:566)
[error] Caused by: java.lang.ClassNotFoundException: org.apache.spark.sql.SparkSession$
[error]         at java.base/java.net.URLClassLoader.findClass(URLClassLoader.java:471)

That baffles me for days. Please help me out...Thanks so much

Do you have any build.sbt in subproject folders ? – Anton Solovev Dec 28 '21 at 17:09 — Anton Solovev, Dec 28 '21 at 17:09

score 0 · Answer 1 · answered Dec 27 '21 at 04:45

0

Please try to add dependsOn

val analyticsFrameless =
  (project in file("."))
    .dependsOn(sqlChoreography, impressionModelEtl)
    .aggregate(sqlChoreography, impressionModelEtl)
    .settings(
      libraryDependencies ++= deps
    )

if you are using shred test classes also add

.dependsOn(sqlChoreography % "compile->compile;test->test",
           impressionModelEtl % "compile->compile;test->test")

answered Dec 27 '21 at 04:45

Anton Solovev

49
4

Hi Anton, thx for reply... I tried your both ways of `dependsOn` but still failed... And I actually can run Spark code in the root project(which is `analyticsFrameless`) but fail on the subproject(which is `impressionModelEtl`) – SkyOne Dec 27 '21 at 08:47
1

I think there is a work around for you is to remove `Provided` scope and exclude them in `assemblyMergeStrategy` step – Anton Solovev Dec 28 '21 at 17:40

score 0 · Accepted Answer · answered Dec 29 '21 at 08:55

Finally I figured out a solution. Just separate the build.sbt file in the parent project into its subproject.

Like in ./build.sbt:

import Dependencies._
ThisBuild / trackInternalDependencies := TrackLevel.TrackIfMissing
ThisBuild / exportJars                := true
ThisBuild / scalaVersion              := "2.12.12"
ThisBuild / version                   := "0.0.1"

ThisBuild / Test / parallelExecution := false
ThisBuild / Test / fork              := true
ThisBuild / Test / javaOptions ++= Seq(
  "-Xms512M",
  "-Xmx2048M",
  "-XX:MaxPermSize=2048M",
  "-XX:+CMSClassUnloadingEnabled"
)

val analyticsFrameless =
  (project in file("."))
    // .dependsOn(sqlChoreography % "compile->compile;test->test", impressionModelEtl % "compile->compile;test->test")
    .settings(
      libraryDependencies ++= deps
    )

lazy val sqlChoreography =
  (project in file("sql-choreography"))

lazy val impressionModelEtl =
  (project in file("impression-model-etl"))

While in impression-model-etl dir, create another build.sbt file:

import Dependencies._

lazy val impressionModelEtl =
  (project in file("."))
    .settings(
      libraryDependencies ++= deps ++ Seq(
        "com.google.guava"            % "guava"         % "30.1.1-jre",
        "io.delta"                   %% "delta-core"    % "1.0.0",
        "com.google.cloud.bigdataoss" % "gcs-connector" % "hadoop2-2.1.3"
      )
      // , assembly / assemblyExcludedJars := {
      //   val cp = (assembly / fullClasspath).value
      //   cp filter { _.data.getName == "org.apache.spark" }
      // }
    )

Compile / run := Defaults
  .runTask(
    Compile / fullClasspath,
    Compile / run / mainClass,
    Compile / run / runner
  )
  .evaluated

assembly / assemblyOption := (assembly / assemblyOption).value.withIncludeBin(false)

assembly / assemblyJarName := s"${name.value}_${scalaBinaryVersion.value}-${sparkVersion}_${version.value}.jar"

name := "impression"

And be sure to extract common Spark libraries to the parent project dir, with a Dependencies.scala file:

import sbt._

object Dependencies {
  // Versions
  lazy val sparkVersion = "3.1.2"

  val deps = Seq(
    "org.apache.spark"       %% "spark-sql"                        % sparkVersion             % "provided",
    "org.apache.spark"       %% "spark-core"                       % sparkVersion             % "provided",
    "org.apache.spark"       %% "spark-mllib"                      % sparkVersion             % "provided",
    "org.apache.spark"       %% "spark-avro"                       % sparkVersion             % "provided",
    ...
  )
}

And with all these steps done, it's normal to run Spark code locally in the subproject folder whilst setting Spark dependencies as "provided".

SBT run with provided works under the '.' projects but fails with no mercy under any subprojects

2 Answers2