How to perform one operation on each executor once in spark

Question

I have a weka model stored in S3 which is of size around 400MB. Now, I have some set of record on which I want to run the model and perform prediction.

For performing prediction, What I have tried is,

Download and load the model on driver as a static object , broadcast it to all executors. Perform a map operation on prediction RDD. ----> Not working, as in Weka for performing prediction, model object needs to be modified and broadcast require a read-only copy.
Download and load the model on driver as a static object and send it to executor in each map operation. -----> Working (Not efficient, as in each map operation, i am passing 400MB object)
Download the model on driver and load it on each executor and cache it there. (Don't know how to do that)

Does someone have any idea how can I load the model on each executor once and cache it so that for other records I don't load it again?

Starting with Spark 2.4 there's a Java interface `ExecutorPlugin` that allows custom `init()` and `shutdown()` > https://issues.apache.org/jira/browse/SPARK-24918 — Samson Scharfrichter, Feb 18 '20 at 14:31

Dia Kharrat · Answer 1 · 2017-11-15T17:24:01.123

40

You have two options:

1. Create a singleton object with a lazy val representing the data:

    object WekaModel {
        lazy val data = {
            // initialize data here. This will only happen once per JVM process
        }
    }

Then, you can use the lazy val in your map function. The lazy val ensures that each worker JVM initializes their own instance of the data. No serialization or broadcasts will be performed for data.

    elementsRDD.map { element =>
        // use WekaModel.data here
    }

Advantages

is more efficient, as it allows you to initialize your data once per JVM instance. This approach is a good choice when needing to initialize a database connection pool for example.

Disadvantages

Less control over initialization. For example, it's trickier to initialize your object if you require runtime parameters.
You can't really free up or release the object if you need to. Usually, that's acceptable, since the OS will free up the resources when the process exits.

2. Use the `mapPartition` (or `foreachPartition`) method on the RDD instead of just `map`.

This allows you to initialize whatever you need for the entire partition.

    elementsRDD.mapPartition { elements =>
        val model = new WekaModel()

        elements.map { element =>
            // use model and element. there is a single instance of model per partition.
        }
    }

Advantages:

Provides more flexibility in the initialization and deinitialization of objects.

Disadvantages

Each partition will create and initialize a new instance of your object. Depending on how many partitions you have per JVM instance, it may or may not be an issue.

edited Nov 15 '17 at 17:24

answered Oct 14 '16 at 02:46

Dia Kharrat

5,948
3
32
43

2

Are you sure about #1? I am getting serialization errors with it. Also, what do you do if the data initialization depends on runtime parameters? – Frank Apr 20 '17 at 00:54
There shouldn't be any serialization happening with approach #1. If there is, it's likely that you're referencing an intermediate object in your RDD methods. Regarding your question about initialization, it is indeed more difficult to control. Your runtime parameters would need to also be available statically (e.g. via System properties or configuration files). Singleton initialization isn't Spark-specific though; it's a Scala topic. – Dia Kharrat Apr 21 '17 at 04:14
4

Can you do the same in Java? – Ross Brigoli Oct 04 '17 at 08:50
3

With respect to this disadvantage, `Less control over initialization. For example, it's trickier to initialize your object if you require runtime parameters.`. This is exactly what I am trying to achieve. Do you have any examples or have you seen this done? I am making a call to an external system to get the database connection configs. So ideally, I do not want to call the external system on each executor. I just asked this question which is very similar. https://stackoverflow.com/questions/47241882/spark-object-singleton-serialization-on-executors – Alex Naspo Nov 11 '17 at 19:36
One possible way to initialize your DB connection is via System properties (e.g. `System.getProperty("db.host")`) – Dia Kharrat Nov 12 '17 at 05:59
@DiaKharrat can you please provide a solutin in python? if a want to load a huge file (8gb) pretrained embedding file on each executor once, how we can do it? – bib Feb 13 '19 at 06:39

score 2 · Answer 2 · answered May 18 '17 at 06:24

Here's what worked for me even better than the lazy initializer. I created an object level pointer initialized to null, and let each executor initialize it. In the initialization block you can have run-once code. Note that each processing batch will reset local variables but not the Object-level ones.

object Thing1 {
  var bigObject : BigObject = null

  def main(args: Array[String]) : Unit = {
    val sc = <spark/scala magic here>
    sc.textFile(infile).map(line => {
      if (bigObject == null) {
         // this takes a minute but runs just once
         bigObject = new BigObject(parameters)  
      }
      bigObject.transform(line)
    })
  }
}

This approach creates exactly one big object per executor, rather than the one big object per partition of other approaches.

If you put the var bigObject : BigObject = null within the main function namespace, it behaves differently. In that case, it runs the bigObject constructor at the beginning of each partition (ie. batch). If you have a memory leak, then this will eventually kill the executor. Garbage collection would also need to do more work.

this will call `new BigObject` multiple times if your `spark.executor.cores` is greater than 1. the lazy approach protects against concurrent initialisation. — Dan, Jul 16 '17 at 16:32
@Dale, you code is technically not thread-safe, since multiple executors can initialize your global object if they run concurrently. — Dia Kharrat, Nov 15 '17 at 17:27

score 1 · Answer 3 · answered Dec 20 '19 at 22:59

Here is what we usually do

define a singleton client that do those kind of stuff to ensure only one client is present in each executors
have a getorcreate method to create or fetch the client information, usulaly let's you have a common serving platform you want to serve for multiple different models, then we can use like concurrentmap to ensure threadsafe and computeifabsent
the getorcreate method will be called inside RDD level like transform or foreachpartition, so make sure init happen in executor level

score 1 · Answer 4 · answered Nov 14 '20 at 16:51

You can achieve this by broadcasting a case object with a lazy val as follows:

case object localSlowTwo {lazy val value: Int = {Thread.sleep(1000); 2}}
val broadcastSlowTwo = sc.broadcast(localSlowTwo)
(1 to 1000).toDS.repartition(100).map(_ * broadcastSlowTwo.value.value).collect

The event timeline for this on three executors with three threads each looks as follows:

Running the last line again from the same spark-shell session does not initialize any more:

score 0 · Answer 5 · answered Feb 06 '23 at 11:23

This works for me and it's threadsafe if you use singleton and synchronized like shown below

object singletonObj {
  var data: dataObj =null
  def getDataObj(): dataObj = this.synchronized {
     if (this.data==null){
        this.data = new dataObj()
     }
     this.data
  }
}

object app {
   def main(args: Array[String]): Unit = {
      lazy val mydata: dataObj = singletonObj.getDataObj()
      df.map(x=>{ functionA(mydata) })
   }
}

How to perform one operation on each executor once in spark

5 Answers5

1. Create a singleton object with a lazy val representing the data:

2. Use the `mapPartition` (or `foreachPartition`) method on the RDD instead of just `map`.

Linked

How to perform one operation on each executor once in spark

5 Answers5

1. Create a singleton object with a lazy val representing the data:

2. Use the mapPartition (or foreachPartition) method on the RDD instead of just map.

Linked

2. Use the `mapPartition` (or `foreachPartition`) method on the RDD instead of just `map`.