The spark docs says that
By default, when Spark runs a function in parallel as a set of tasks on different nodes, it ships a copy of each variable used in the function to each task.
If I create a Java SimpleDateFormat
and use it in RDD operations, I got a exception NumberFormatException: multiple points
.
I know SimpleDateFormat
is not thread-safe. But as said by spark docs, this SimpleDateFormat
object is copied to each task, so there should not be multiple threads accessing this object.
I speculate that all task in one executor shares the same SimpleDateFormate
object, am I right?
This program prints the same object java.text.SimpleDateFormat@f82ede60
object NormalVariable {
// create dateFormat here doesn't change
// val dateFormat = new SimpleDateFormat("yyyy.MM.dd")
def main(args: Array[String]) {
val dateFormat = new SimpleDateFormat("yyyy.MM.dd")
val conf = new SparkConf().setAppName("Spark Test").setMaster("local[*]")
val spark = new SparkContext(conf)
val dates = Array[String]("1999.09.09", "2000.09.09", "2001.09.09", "2002.09.09", "2003.09.09")
println(dateFormat)
val resultes = spark.parallelize(dates).map { i =>
println(dateFormat)
dateFormat.parse(i)
}.collect()
println(resultes.mkString(" "))
spark.stop()
}
}