Passing function in pyspark

Question

The purpose of the code is to load compute some logic based on the "myFunc" method that operated on RDD to get parallelization benefit.

The following lines : df_rdd = ParallelBuild().run().map(lambda line: line).persist() r = df_rdd.map(ParallelBuild().myFunc)

gave me exit 0. Reading google suggested that Spark is lazy evaluation so some action will trigger the effect and i add :

r.count() gives me :

TypeError: 'JavaPackage' object is not callable

Noticeable thing was that : r = df_rdd.map(ParallelBuild().myFunc)

gives "pipelinedrdd" not sure what that is but looks like some transformation?

Interesting part was that when I removed run method and implemented: data = [(1,'a'), (1,'b'), (1,'c'), (2,'d'), (3,'r'), (4,'a'), (2,'t'), (3,'y'), (1,'f')] df = sqlContext.createDataFrame(data, schema= ['uid', 'address_uid'])

directly in my main function then things worked just fine. But obviously i loose modular part of my code.

Code:

from pyspark import SparkContext
from pyspark.sql import SQLContext, HiveContext
import csv, io, StringIO
from pyspark.sql.functions import *
from pyspark.sql import Row
from pyspark.sql import *
from pyspark.sql import functions as F
from pyspark.sql.functions import asc, desc
sc = SparkContext("local", "Summary Report")
sqlContext = SQLContext(sc)


class ParallelBuild(object):

def myFunc(self, s):
   l =  s.split(',')
   print l[0], l[1]
   return l[0]


def list_to_csv_str(x):
    output = StringIO.StringIO("")
    csv.writer(output).writerow(x)
    return output.getvalue().strip()

def run(self):
    data = [(1,'a'), (1,'b'), (1,'c'), (2,'d'), (3,'r'), (4,'a'), (2,'t'), (3,'y'), (1,'f')]
    df = sqlContext.createDataFrame(data, schema= ['uid', 'address_uid'])
    return df


if __name__ == "__main__":

    df_rdd = ParallelBuild().run().map(lambda line: line).persist()
    r = df_rdd.map(ParallelBuild().myFunc)
    r.count()

score 0 · Answer 1 · edited May 23 '17 at 12:24

Ok, so your main question is "Why isn't anything printing?" There are two parts to the answer.

You can't really print in distributed computing. So your function myFunc won't print anything to the driver. The reason for this is rather complicated, so I direct you to this page for more information on why printing doesn't really work in Spark.

However, calling r.count() should print out 9. Why doesn't that work?

Your function myFunc doesn't make much sense. When you call it in r = df_rdd.map(ParallelBuild().myFunc), you're passing in the df_rdd, I think. But this is already a DataFrame. Each row of this DataFrame is of type Row and if you call df_rdd.first() you will get Row(uid=1, address_uid=u'a'). What you are doing in myFunc is an attempt to use split, but split is for string objects, and you have Row objects. I'm not sure why this isn't throwing an error, but you simply can't call split on a Row object. Consider something more along the lines of r = df_rdd.map(lambda x: x[0]).

So, I think r.count() doesn't work because something is getting muddled when you call myFunc.

Side note:

df_rdd = ParallelBuild().run().map(lambda line: line).persist(). Running .map(lambda line: line) doesn't do anything. You're not making any changes to line, so don't run a map job. Have the code be df_rdd = ParallelBuild().run().persist()

Passing function in pyspark

1 Answers1