I'm looking for some help with how object
/singleton val
s are initialized when using the spark-shell.
Test code:
import org.apache.spark.sql.SparkSession
object Consts {
val f2 = 2
}
object Test extends Serializable {
val f1 = 1
println(s"-- init Test singleton f1=${f1} f2=${Consts.f2}")
def doWorkWithF1(x: Int) = {
f1
}
def doPartitionWorkWithF1(partitionId: Int, iter: Iterator[Int]) = {
iter.map(x => f1)
}
def doPartitionWorkWithF2(partitionId: Int, iter: Iterator[Int]) = {
iter.map(x => Consts.f2)
}
def main(args: Array[String]) {
println(s"-- main starting f1=${f1} f2=${Consts.f2}")
val spark = SparkSession.builder().getOrCreate()
val rdd = spark.sparkContext.parallelize(List(1,2,3,4))
rdd.map(doWorkWithF1).foreach(print)
rdd.mapPartitionsWithIndex(doPartitionWorkWithF1).foreach(print)
rdd.mapPartitionsWithIndex(doPartitionWorkWithF2).foreach(print)
}
}
Running:
$ spark-shell --master local[4]
scala> :paste "test.scala"
...
defined object Consts
defined object Test
scala> Test.main(Array())
-- init Test singleton f1=1 f2=2
-- main starting f1=1 f2=2
11110000
23/02/22 21:03:31 ERROR executor.Executor: Exception in task 1.0 in stage 2.0 (TID 9)
java.lang.NullPointerException
at $line15.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$Test$$anonfun$doPartitionWorkWithF2$1.apply$mcII$sp(test.scala:37)
at $line15.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$Test$$anonfun$doPartitionWorkWithF2$1.apply(test.scala:37)
at $line15.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$Test$$anonfun$doPartitionWorkWithF2$1.apply(test.scala:37)
...
doWorkWithF1
(usingmap
) works as I expect. The output is1111
.- In
doPartitionWorkWithF1
, the output is not what I expect. It is0000
. Why is valf1
set to 0 and not 1?
Asked another way: when are integer vals in object singletons only initialized to 0? - In
doPartitionWorkWithF2
, I assume the null pointer exception is becausef2
isnull
. Why is that?
Asked another way: when are vals in object singletons only initialized tonull
?
Changing line 6 to add lazy
lazy val f1 = 1
makes doPartitionWorkWithF1
work as I desire (expected)--i.e., 1111
is the result in the spark-shell.
And this is where spark gets frustrating to work with: if the original version (without lazy
) is compiled and run using spark-submit
I get the desired/expected result:
$ /usr/bin/spark-submit --master local[4] --driver-memory 1024m --name "TEST" --class Test test.jar 2> err
-- init Test singleton f1=1 f2=2
-- main starting f1=1 f2=2
111111112222
I really don't like it when I have to write code differently to work in the spark-shell. But since the shell is so convenient, I do it. These kinds of nuances cost me a lot of time and effort though. The above is the salient parts of a 2000-line program that took me hours to figure out where in the code the shell was doing something different than the compiled version.