I've been looking at the documentation for Spark and it mentions this:
Spark’s API relies heavily on passing functions in the driver program to run on the cluster. There are two recommended ways to do this:
Anonymous function syntax, which can be used for short pieces of code. Static methods in a global singleton object. For example, you can define object MyFunctions and then pass MyFunctions.func1, as follows:
object MyFunctions { def func1(s: String): String = { ... } }
myRdd.map(MyFunctions.func1)
Note that while it is also possible to pass a reference to a method in a class instance (as opposed to a singleton object), this requires sending the object that contains that class along with the method. For example, consider:
class MyClass {
def func1(s: String): String = { ... }
def doStuff(rdd: RDD[String]): RDD[String] = { rdd.map(func1) }
}
Here, if we create a new MyClass and call doStuff on it, the map inside there references the func1 method of that MyClass instance, so the whole object needs to be sent to the cluster. It is similar to writing
rdd.map(x => this.func1(x))
.
Now my doubt is what happens if you have attributes on the singleton object (which are supposed to be equivalent to static). Same example with a small alteration:
object MyClass {
val value = 1
def func1(s: String): String = { s + value }
}
myRdd.map(MyClass.func1)
So the function is still referenced statically, but how far does Spark goes by trying to serialize all referenced variables? Will it serialize value
or will it be initialized again in the remote workers?
Additionally, this is all in the context that I have some heavy models inside a singleton object and I would like to find the correct way to serialize them to workers while keeping the ability to reference them from the singleton everywhere, instead of passing them around as function parameters across a pretty deep function call stack.
Any in-depth information on what/how/when does Spark serialize things would be appreciated.