Using external modules when applying custom functions in PySpark

Question

The following snippet tries to apply a simple function to a PySpark RDD object:

import pyspark
conf = pyspark.SparkConf()
conf.set('spark.dynamicAllocation.minExecutors', 5)
sc = SparkContext(appName="tmp", conf=conf)
sc.setLogLevel('WARN')

fn = 'my_csv_file'
rdd = sc.textFile(fn)
rdd = rdd.map(lambda line: line.split(","))
header = rdd.first()
rdd = rdd.filter(lambda line:line != header)
def parse_line(line):
    ret = pyspark.Row(**{h:line[i] for (i, h) in enumerate(header)})
    return ret
rows = rdd.map(lambda line: parse_line(line))
sdf = rows.toDF()

If I start the program with python my_snippet.py, it fails by complaining that :

File "<ipython-input-27-8e46d56b2984>", line 6, in <lambda>
File "<ipython-input-27-8e46d56b2984>", line 3, in parse_line
NameError: global name 'pyspark' is not defined

I replaced the parse_line function to the following:

def parse_line(line):
    ret = h:line[i] for (i, h) in enumerate(header)
    ret['dir'] = dir()
    return ret

Now, the dataframe is created and the dir column shows that the namespace inside the function contains only two objects: line and ret. How can I have other modules and object as the part of the function? Not only pyspark but others too.

EDIT Note, that pyspark is available within the program. Only if the function is called by map (and, I assume filter, reduce and others), it doesn't see any imported modules.

Does the following answer your question? http://stackoverflow.com/questions/23256536/importing-pyspark-in-python-shell — Yaron, Mar 20 '16 at 10:13

Yaron · Accepted Answer · 2016-03-20T14:19:38.690

1

1) Answer to the original question: It seems like the source of the problem is running python my_snippet.py You should execute your code using spark-submit my_snippet.py

2) answer to ipython notebook question: in my ipython notebook personal configuration the following lines doesn't exists:

import pyspark
conf = pyspark.SparkConf()
conf.set('spark.dynamicAllocation.minExecutors', 5)
sc = SparkContext(appName="tmp", conf=conf)

"sc" is defined outside the scope of my program

3) Answer to question regarding numpy (or other module which needs to be installed) In order to use numpy, you need to install numpy (using apt-get or pip or install from sources) on every node in your cluster.

edited Mar 20 '16 at 14:19

answered Mar 20 '16 at 12:45

Yaron

10,166
9
45
65

you are right, running with `spark-submit` does solve the problem in the case of a stand-alone program. If, on the other hand, I want to run an IPython notebook, this won't work. I can start IPython with `IPYTHON_OPTS="notebook " pyspark` but then I can't change `SparkContext` options during the runtime (can I?). Moreover, if, instead of `pyspark`, the function uses `numpy` or any other module, `spark-submit` won't help either – David D Mar 20 '16 at 13:57

Using external modules when applying custom functions in PySpark

1 Answers1