I am learning Apache-Spark now. After carefully reading Spark tutorials, I understand how to pass a Python function to Apache-Spark to deal with RDD dataset. But now I still have no ideas on how Apache-Spark works with methods inside a class. For example, I have my code as below:
import numpy as np
import copy
from pyspark import SparkConf, SparkContext
class A():
def __init__(self, n):
self.num = n
class B(A):
### Copy the item of class A to B.
def __init__(self, A):
self.num = copy.deepcopy(A.num)
### Print out the item of B
def display(self, s):
print s.num
return s
def main():
### Locally run an application "test" using Spark.
conf = SparkConf().setAppName("test").setMaster("local[2]")
### Setup the Spark configuration.
sc = SparkContext(conf = conf)
### "data" is a list to store a list of instances of class A.
data = []
for i in np.arange(5):
x = A(i)
data.append(x)
### "lines" separate "data" in Spark.
lines = sc.parallelize(data)
### Parallelly creates a list of instances of class B using
### Spark "map".
temp = lines.map(B)
### Now I got the error when it runs the following code:
### NameError: global name 'display' is not defined.
temp1 = temp.map(display)
if __name__ == "__main__":
main()
In fact, I used the above code to parallelly generate a list of instances of class B
using temp = lines.map(B)
. After that, I did temp1 = temp.map(display)
, as I wanted to parallelly print out each of the items in that list of instances of class B
. But now the error shows up: NameError: global name 'display' is not defined.
I am wondering how I can fix the error, if I still use Apache-Spark parallel computing. I really appreciate if anyone helps me.