1

Scala/JVM noob here that wants to understand more about logging, specifically when using Apache Spark.

I have written a library in Scala that depends upon a bunch of Spark libraries, here are my dependencies:

import sbt._

object Dependencies {

  object Version {
    val spark = "2.2.0"
    val scalaTest = "3.0.0"
  }

  val deps = Seq(
    "org.apache.spark" %% "spark-core" % Version.spark,
    "org.scalatest" %% "scalatest" % Version.scalaTest,
    "org.apache.spark" %% "spark-hive" % Version.spark,
    "org.apache.spark" %% "spark-sql" % Version.spark,
    "com.holdenkarau" %% "spark-testing-base" % "2.2.0_0.8.0" % "test",
    "ch.qos.logback" % "logback-core" % "1.2.3",
    "ch.qos.logback" % "logback-classic" % "1.2.3",
    "com.typesafe.scala-logging" %% "scala-logging" % "3.8.0",
    "com.typesafe" % "config" % "1.3.2"
  )
  val exc = Seq(
    ExclusionRule("org.slf4j", "slf4j-log4j12")
  )
}

(admittedly I copied a lot of this from elsewhere).

I am able to package my code as a JAR using sbt package which I can then call from Spark by placing the JAR into ${SPARK_HOME}/jars. This is working great.

I now want to implement logging from my code so I do this:

import com.typesafe.scalalogging.Logger
/*
 * stuff stuff stuff
 */
  val logger : Logger = Logger("name")
  logger.info("stuff")

however when I try and call my library (which I'm doing from Python, not that I think that's relevant here) I get an error:

py4j.protocol.Py4JJavaError: An error occurred while calling z:com.company.package.class.function.
E : java.lang.NoClassDefFoundError: com/typesafe/scalalogging/Logger$

Clearly this is because com.typesafe.scala-logging library is not in my JAR. I know I could solve this by packaging using sbt assembly but I don't want to do that because it will include all the other dependencies and cause my JAR to be enormous.

Is there a way to selectively include libraries (com.typesafe.scala-logging in this case) in my JAR? Alternatively, should I be attempting to log using another method, perhaps using a logger that is included with Spark?


Thanks to pasha701 in the comments I attempted packaging my dependencies by using sbt assembly rather than sbt package.

import sbt._

object Dependencies {

  object Version {
    val spark = "2.2.0"
    val scalaTest = "3.0.0"
  }

  val deps = Seq(
    "org.apache.spark" %% "spark-core" % Version.spark % Provided,
    "org.scalatest" %% "scalatest" % Version.scalaTest,
    "org.apache.spark" %% "spark-hive" % Version.spark % Provided,
    "org.apache.spark" %% "spark-sql" % Version.spark % Provided,
    "com.holdenkarau" %% "spark-testing-base" % "2.2.0_0.8.0" % "test",
    "ch.qos.logback" % "logback-core" % "1.2.3",
    "ch.qos.logback" % "logback-classic" % "1.2.3",
    "com.typesafe.scala-logging" %% "scala-logging" % "3.8.0",
    "com.typesafe" % "config" % "1.3.2"
  )

  val exc = Seq(
    ExclusionRule("org.slf4j", "slf4j-log4j12")
  )

}

Unfortunately, even if specifying the spark dependencies as Provided my JAR went from 324K to 12M hence I opted to use println() instead. Here is my commit message:

log using println

I went with the println option because it keeps the size of the JAR small.

I trialled use of com.typesafe.scalalogging.Logger but my tests failed with error:

java.lang.NoClassDefFoundError: com/typesafe/scalalogging/Logger

because that isn't provided with Spark. I attempted to use sbt assembly instead of sbt package but this caused the size of the JAR to go from 324K to 12M, even with spark dependencies set to Provided. A 12M JAR isn't worth the trade-off just to use scalaLogging, hence using println instead.

I note that pasha701 suggested using log4j instead as that is provided with Spark so I shall try that next. Any advice on using log4j from Scala when writing a Spark library would be much appreciated.

jamiet
  • 10,501
  • 14
  • 80
  • 159
  • "log4j" is used by Spark, and can be used in application. But right way is to set correct scope for dependencies, for ex. "provided" for Spark libs ("spark-core","spark-sql", etc), and use sbt "assembly" for generate fat jar with all dependencies. – pasha701 Jun 16 '20 at 16:52
  • Thanks. Could you elaborate on “set correct scope for dependencies”? I’m not sure what is meant by that. – jamiet Jun 16 '20 at 21:47
  • Ah, googling led me to https://stackoverflow.com/a/31471907 which I think answers my previous question. Am I correct? – jamiet Jun 16 '20 at 21:54
  • yes, correct. PS: for Typesafe logging, traits "StrictLogging" or "LazyLogging" are better choice than creating logger in each class. – pasha701 Jun 17 '20 at 07:22
  • @pasha701 I tried setting scope as Provided but it still caused my JAR to grow massively in size so wasn't worth the trade-off. I have updated the question accordingly. Any advice on using log4j would be much appreciated. – jamiet Jun 19 '20 at 07:45
  • 1
    Log4j manual: https://logging.apache.org/log4j/1.2/manual.html – pasha701 Jun 19 '20 at 17:29

1 Answers1

1

As you said 'sbt assembly' will include all the dependencies into your jar.

If you want use certain two option:

  1. Download logback-core and logback-classic and add them on --jar spark2-submit command
  2. Specify the above deps in --packages spark2-submit option
ShemTov
  • 687
  • 3
  • 8