Apache Spark logging within Scala

Question

I am looking for a solution to be able to log additional data when executing code on Apache Spark Nodes that could help investigate later some issues that might appear during execution. Trying to use a traditional solution like for example com.typesafe.scalalogging.LazyLogging fails because the log instance cannot be serialized on a distributed environment like Apache Spark.

I've investigated this problem and for now the solution that I found was to use the org.apache.spark.Logging trait like this :

class SparkExample with Logging {
  val someRDD = ...
  someRDD.map {
    rddElement => logInfo(s"$rddElement will be processed.")
    doSomething(rddElement)
  }
}

However it looks like the Logging trait is not a permanent solution for Apache Spark because it's marked as @DeveloperApi and the class documentation mentions:

This will likely be changed or removed in future releases.

I am wondering - are they any known logging solution that I can use and will allow me to log data when the RDDs are executed on Apache Spark nodes ?

@Later Edit : Some of the comments from below suggest to use Log4J. I've tried using Log4J but I'm still having issues when using logger from a Scala class (and not a Scala object). Here is my full code :

import org.apache.log4j.Logger
import org.apache.spark._

object Main {
 def main(args: Array[String]) {
  new LoggingTestWithRDD().doTest()
 }
}

class LoggingTestWithRDD extends Serializable {

  val log = Logger.getLogger(getClass.getName)

  def doTest(): Unit = {
   val conf = new SparkConf().setMaster("local[4]").setAppName("LogTest")
   val spark = new SparkContext(conf)

   val someRdd = spark.parallelize(List(1, 2, 3))
   someRdd.map {
     element =>
       log.info(s"$element will be processed")
       element + 1
    }
   spark.stop()
 }

}

The exception that I'm seeing is :

Exception in thread "main" org.apache.spark.SparkException: Task not serializable -> Caused by: java.io.NotSerializableException: org.apache.log4j.Logger

In addition to / as an alternative to logging, metrics may give you what you want: http://spark.apache.org/docs/latest/monitoring.html — michael, Mar 23 '15 at 17:19
@michael_n That's not correct. log4j and slf4j are different thing. — ben_frankly, Aug 05 '15 at 23:21
@ben_frankly you misunderstood the joke. I'm well aware of slf4j/log4j, but many are/were (justifiably) confused about their roles (and perhaps still are). Log4j is an API *and* an implementation; slf4j is an API. When selecting a logging API, people *should* choose slf4j. This does not preclude using log4j as the implementation. (Anywhere "log4j" appears in code should be "spell checked" to be slf4j :-)) — michael, Aug 07 '15 at 06:38
If you want something guaranteed to not change, and think its worth the effort, write it yourself. Possibly using Akka. But I don't think that is worthwhile - just change the code if Spark forces it. — BAR, Oct 04 '15 at 20:30
There is no reason for using slf4j in an application, only in a library. BTW, I would recommend using Log4j 2.x. — Mikael Ståldal, Jul 08 '16 at 13:31
If you use Log4j 2.x, the example should work since the Logger in Log4j 2.x is Serializable. — Mikael Ståldal, Jul 08 '16 at 13:51
Can't you create the logger inside of `rdd.foreachPartition` section that allows to avoid logger serialization accross worker nodes? In this case each worker will have its own logger. — stanislav.chetvertkov, Nov 24 '16 at 10:54

score 54 · Answer 1 · edited Oct 04 '15 at 19:54

54

You can use Akhil's solution proposed in
https://www.mail-archive.com/user@spark.apache.org/msg29010.html. I have used by myself and it works.

Akhil Das Mon, 25 May 2015 08:20:40 -0700
Try this way:

object Holder extends Serializable {      
   @transient lazy val log = Logger.getLogger(getClass.getName)    
}


val someRdd = spark.parallelize(List(1, 2, 3)).foreach { element =>
   Holder.log.info(element)
}

edited Oct 04 '15 at 19:54

BAR

15,909
27
97
185

answered May 26 '15 at 08:42

florins

1,605
1
17
33

3

I had a spark streaming Custom Receiver crapping NullPointering itself to death and spent a day pulling my beards. This was the solution. Thanks. – Rodrigo Del C. Andrade Dec 17 '15 at 18:11
This seems to be work around. Say I want to enable a log from particular module. How to do that ? – Knight71 Jun 28 '16 at 13:18
I could not see any logs in the output logs. Could you suggest where I might be going wrong. – Shilpa Nov 21 '16 at 00:20
I'm still having trouble with this exact approach within spark-jobserver. Anyone trying that configuration? Everything looks as though it should work, but I don't see any logging within the closure. – Justin Standard Mar 16 '17 at 21:47
can you tell me where getLogger method is imported from – j pavan kumar May 06 '17 at 17:15

score 11 · Answer 2 · edited Nov 28 '18 at 18:37

11

Use Log4j 2.x. The core logger has been made serializable. Problem solved.

Jira discussion: https://issues.apache.org/jira/browse/LOG4J2-801

"org.apache.logging.log4j" % "log4j-api" % "2.x.x"

"org.apache.logging.log4j" % "log4j-core" % "2.x.x"

"org.apache.logging.log4j" %% "log4j-api-scala" % "2.x.x"

edited Nov 28 '18 at 18:37

Ram Ghadiyaram

28,239
13
95
121

answered Apr 25 '17 at 17:35

Ryan Stack

1,231
1
12
25

3

Can you please give complete implementation of this logging. like how you create log4j2.properties and how implemented in code. – jAi Aug 21 '19 at 06:35

ragazzojp · Answer 3 · 2022-06-27T20:12:30.907

If you need some code to be executed before and after a map, filter or other RDD function, try to use mapPartition, where the underlying iterator is passed explicitely.

Example:

val log = ??? // this gets captured and produces serialization error
rdd.map { x =>
  log.info(x)
  x+1
}

Becomes:

rdd.mapPartition { it =>
  val log = ??? // this is freshly initialized in worker nodes
  it.map { x =>
    log.info(x)
    x + 1
  }
}

Every basic RDD function is implemented with a mapPartition.

Make sure to handle the partitioner explicitly and not to lose it: see Scaladoc, preservesPartitioning parameter, this is critical for performances.

score 2 · Answer 4 · edited Nov 28 '18 at 18:38

This is an old post but I want to provide my working solution which I just got after struggling a lot and still can be useful for others:

I want to print rdd contents inside rdd.map function but getting Task Not Serializalable Error. This is my solution for this problem using scala static object which is extending java.io.Serializable:

import org.apache.log4j.Level

object MyClass extends Serializable{

val log = org.apache.log4j.LogManager.getLogger("name of my spark log")

log.setLevel(Level.INFO)

def main(args:Array[String])
{

rdd.map(t=>

//Using object's logger here

val log =MyClass.log

log.INFO("count"+rdd.count)
)
}

}

score 2 · Answer 5 · answered Dec 07 '20 at 11:51

Making the logger transient and lazy does the trick

@transient lazy val log = Logger.getLogger(getClass.getName)

@transient will tell the spark to not serialize it for all executors and lazy will cause the instance to be created when it is first used. In other words each executor will have their own instance of the logger. Serializing the logger is not a good idea anyway even if you can.

Ofcourse anything you put in the map() closure will run on the executor so will be found in executor logs and not the driver logs. For custom log4j properties on the executors you need to add the log4j.properties to executor classpath and send your log4j.properties to the executors.

This can be done by adding the following args to your spark-submit command --conf "spark.executor.extraJavaOptions=-Dlog4j.configuration=./log4j.properties " --files ./log4j.properties There are other ways to do set these configs but this one is the most common.

score 1 · Answer 6 · edited May 26 '15 at 11:06

1

val log = Logger.getLogger(getClass.getName),

You can use "log" to write logs . Also if you need change logger properties you need to have log4j.properties in /conf folder. By default we will have a template in that location.

edited May 26 '15 at 11:06

bummi

27,123
14
62
101

answered Mar 23 '15 at 17:11

Karthik

1,801
1
13
21

I've tried to use log4j but I am still having serialization issues when calling the logger variable from a class (not from a object) : `Exception in thread "main" org.apache.spark.SparkException: Task not serializable -> Caused by: java.io.NotSerializableException: org.apache.log4j.Logger` – Bogdan N Mar 24 '15 at 08:14
3

Simple solution: declare the log variable in local method scope. – nuaavee May 26 '15 at 17:50
2

What if you make "log" @transient ? – Mikael Ståldal Jul 08 '16 at 13:34
Map partition comes to rescue for such thing. You can create logger in mappartitions function and use it. This technique is used for JDBC connection/ mq / Kafka producer. – Ashkrit Sharma Feb 16 '19 at 15:45

score 0 · Answer 7 · answered Apr 06 '16 at 08:22

Here is my solution :

I am using SLF4j (with Log4j binding), in my base class of every spark job I have something like this:

import org.slf4j.LoggerFactory
val LOG = LoggerFactory.getLogger(getClass)

Just before the place where I use LOG in distributed functional code, I copy logger reference to a local constant.

val LOG = this.LOG

It worked for me!

Apache Spark logging within Scala

7 Answers7

Linked