1

I want to build a mill job that allows me to develop and run a Spark job locally either by SparkSample.run or having a full fat jar for local tests. At some point of time I'd like to send it as a filtered assembly (i.e. without all spark related libs, but with all project libs) to a cluster with a running Spark Context.

I currently use this build.sc

import mill._, scalalib._
import mill.modules.Assembly

object SparkSample extends ScalaModule {
  def scalaVersion = "2.12.10"
  def scalacOptions =
    Seq("-encoding", "utf-8", "-explaintypes", "-feature", "-deprecation")

  def ivySparkDeps = Agg(
    ivy"org.apache.spark::spark-sql:2.4.5"
      .exclude("org.slf4j" -> "slf4j-log4j12"),
    ivy"org.slf4j:slf4j-api:1.7.16",
    ivy"org.slf4j:slf4j-log4j12:1.7.16"
  )

  def ivyBaseDeps = Agg(
    ivy"com.lihaoyi::upickle:0.9.7"
  )

  // STANDALONE APP
  def ivyDeps = ivyBaseDeps ++ ivySparkDeps

  // REMOTE SPARK CLUSTER
  // def ivyDeps = ivyBaseDeps
  // def compileIvyDeps = ivySparkDeps
  // def assemblyRules =
  //   Assembly.defaultRules ++
  //     Seq(
  //       "scala/.*",
  //       "org.slf4j.*",
  //       "org.apache.log4j.*"
  //     ).map(Assembly.Rule.ExcludePattern.apply)
}

For running and building a full fat jar, I keep it as is.

For creating a filtered assembly I comment the ivyDeps line under "STANDALONE APP" and uncomment everything below "REMOTE SPARK CLUSTER".

I felt editing a build file for a new task is not very elegant, so I tried to add a separate task to build.sc

  def assembly2 = T {
    def ivyDeps = ivyBaseDeps
    def compileIvyDeps = ivySparkDeps
    def assemblyRules =
      Assembly.defaultRules ++
        Seq(
          "scala/.*",
          "org.slf4j.*",
          "org.apache.log4j.*"
        ).map(Assembly.Rule.ExcludePattern.apply)
    super.assembly
  }

but when I run SparkSample.assembly2 it still gets a full assembly and not a filtered one. Seems like overriding ivyDeps et. al. in a Task does not work.

Is this possible in mill?

Bernhard
  • 25
  • 1
  • 3

1 Answers1

0

You can't override defs in a tasks. Just locally defining some ivyDeps and compileIvyDeps will not magically make super.assembly using them.

Of course you can create that task by looking how super.assembly is defined in JavaModule, but you will end up copying and adapting a lot more targets (upstreamAssembly, upstreamAssemblyClasspath, transitiveLocalClasspath, and so on) and make your buildfile hard to read.

A better way would be to make the lighter dependencies and assembly rules the default and move the creation of the standalone JAR into a sub module.

import mill._, scalalib._
import mill.modules.Assembly

object SparkSample extends ScalaModule { outer =>
  def scalaVersion = "2.12.10"
  def scalacOptions =
    Seq("-encoding", "utf-8", "-explaintypes", "-feature", "-deprecation")

  def ivySparkDeps = Agg(
    ivy"org.apache.spark::spark-sql:2.4.5"
      .exclude("org.slf4j" -> "slf4j-log4j12"),
    ivy"org.slf4j:slf4j-api:1.7.16",
    ivy"org.slf4j:slf4j-log4j12:1.7.16"
  )

  def ivyDeps = Agg(
    ivy"com.lihaoyi::upickle:0.9.7"
  )

  def compileIvyDeps = ivySparkDeps

  def assemblyRules =
    Assembly.defaultRules ++
      Seq(
        "scala/.*",
        "org.slf4j.*",
        "org.apache.log4j.*"
      ).map(Assembly.Rule.ExcludePattern.apply)

  object standalone extends ScalaModule {
    def scalaVersion = outer.scalaVersion
    def moduleDeps = Seq(outer)
    def ivyDeps = outer.ivySparkDeps
  }
}

To create a Spark Cluster JAR run: mill SparkSample.assembly

To create a standalone JAR run: mill SparkSample.standalone.assembly

To create both, you simply run: mill __.assembly

Tobias Roeser
  • 431
  • 2
  • 8
  • I had to add "def finalMainClass = T { 'com.example.SparkSample' }" to "standalone" to allow to run "mill SparkSample.standalone.run". (Now I need to figure out how mill would add the main class to the standalone.assembly manifest file, then all works as I wanted) – Bernhard Feb 24 '20 at 18:57
  • That's because `finalMainClass` is not used for `assembly`. If you use `mainClass` (or `finalMainClassOpt`) it works as expected. – Tobias Roeser Feb 25 '20 at 08:34