Could someone please help me understand the behaviour of appending map functions to an RDD in a python for loop?
For the following code:
rdd = spark.sparkContext.parallelize([[1], [2], [3]])
def appender(l, i):
return l + [i]
for i in range(3):
rdd = rdd.map(lambda x: appender(x, i))
rdd.collect()
I get the output:
[[1, 2, 2, 2], [2, 2, 2, 2], [3, 2, 2, 2]]
Whereas with the following code:
rdd = spark.sparkContext.parallelize([[1], [2], [3]])
def appender(l, i):
return l + [i]
rdd = rdd.map(lambda x: appender(x, 1))
rdd = rdd.map(lambda x: appender(x, 2))
rdd = rdd.map(lambda x: appender(x, 3))
rdd.collect()
I get the expected output:
[[1, 1, 2, 3], [2, 1, 2, 3], [3, 1, 2, 3]]
I imagine this has something to do with the closure that is passed to the PySpark compiler, but I can't find any documentation about this...