The following snippet tries to apply a simple function to a PySpark RDD object:
import pyspark
conf = pyspark.SparkConf()
conf.set('spark.dynamicAllocation.minExecutors', 5)
sc = SparkContext(appName="tmp", conf=conf)
sc.setLogLevel('WARN')
fn = 'my_csv_file'
rdd = sc.textFile(fn)
rdd = rdd.map(lambda line: line.split(","))
header = rdd.first()
rdd = rdd.filter(lambda line:line != header)
def parse_line(line):
ret = pyspark.Row(**{h:line[i] for (i, h) in enumerate(header)})
return ret
rows = rdd.map(lambda line: parse_line(line))
sdf = rows.toDF()
If I start the program with python my_snippet.py
, it fails by complaining that :
File "<ipython-input-27-8e46d56b2984>", line 6, in <lambda>
File "<ipython-input-27-8e46d56b2984>", line 3, in parse_line
NameError: global name 'pyspark' is not defined
I replaced the parse_line
function to the following:
def parse_line(line):
ret = h:line[i] for (i, h) in enumerate(header)
ret['dir'] = dir()
return ret
Now, the dataframe is created and the dir
column shows that the namespace inside
the function contains only two objects: line
and ret
. How can I have other modules and object as the part of the function?
Not only pyspark but others too.
EDIT Note, that pyspark is available within the program. Only if the function is called by map
(and, I assume filter
, reduce
and others), it doesn't see any imported modules.