How to put use a map/lambda inside of a map/lambda in pyspark?

Question

I have a list of lists like this:

b = [['r','w'],['n','finished']]

I would like to be able to operate on each element within each list.

I can do this locally in python:

result = b.map(lambda aList: \
           map(lambda aString: \
                              '' if aString.strip().lower() in [' finish', 'finished', 'terminate', 'done'] else aString,\
                              aList))

But, Spark has trouble serializing the inner map:

 File "/<path>/python/pyspark/worker.py", line 88, in main
12/11/2015 18:24:49 [launcher]      command = pickleSer._read_with_length(infile)
12/11/2015 18:24:49 [launcher]    File "//<path>/spark/python/pyspark/serializers.py", line 156, in _read_with_length
12/11/2015 18:24:49 [launcher]      return self.loads(obj)
12/11/2015 18:24:49 [launcher]    File "//<path>//python/pyspark/serializers.py", line 405, in loads
12/11/2015 18:24:49 [launcher]      return cPickle.loads(obj)
12/11/2015 18:24:49 [launcher]  AttributeError: 'module' object has no attribute 'map'

How do I work around this to either, use an inner map or accomplish the same thing?

zero323 · Answer 1 · 2015-12-12T03:45:40.940

3

One way to handle this:

to_replace = ['finish', 'finished', 'terminate', 'done'] 

rdd = sc.parallelize([['r','w'],['n','finished']])
rdd.map(lambda xs: ['' if  x.strip().lower() in to_replace else x for x in xs])

Generally speaking if you find yourself thinking about nested functions it is a good sign you should use a normal function not a lambda expression.

edited Dec 12 '15 at 03:45

answered Dec 12 '15 at 03:30

zero323

322,348
103
959
935

by "normal function" you mean `built-ins`? – makansij Dec 12 '15 at 05:20
1

Defining a function and use it in map and not using lambda expression. like `def myfunc(x):` and `rdd.map(myfunc)` – WoodChopper Dec 12 '15 at 08:42

score 2 · Answer 2 · edited May 23 '17 at 12:30

Or alternatively and using @zero323's template, if you are using Python 2.x you can use a map instead of a for but this is a python issue not a pyspark one, and the effect is the same.

to_replace = ['finish', 'finished', 'terminate', 'done'] 

rdd = sc.parallelize([['r','w'],['n','finished']])
rdd.map(
    lambda xs: map(lambda x: "" if x.strip().lower() in to_replace else x, xs)
)

But, if to_replace list is really big, you should use a broadcast variable.

How to put use a map/lambda inside of a map/lambda in pyspark?

2 Answers2