pyspark map each row in dataframe and apply UDF which return dataframe

Question

I have a dataframe with several number of rows. I can loop through this dataframe using this code :

for row in df.rdd.collect():

But this is won't work in parallel right? So what I want is to map each row and pass it to UDF and return another new dataframe (from a DB) according to value in row.

I tried df.rdd.map(lambda row:read_from_mongo(row,spark)).toDF()

But I got this error:

_pickle.PicklingError: Could not serialize object: Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transformation. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063.

How do I loop a dataframe in parallel and hold the dataframe returning for each row?

Does this answer your question? [Spark: Broadcast variables: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transforamtion](https://stackoverflow.com/questions/31508689/spark-broadcast-variables-it-appears-that-you-are-attempting-to-reference-spar) — blackbishop, Dec 24 '19 at 11:47

score 0 · Answer 1 · answered Dec 29 '19 at 14:53

0

Every Spark RDD or DataFrame created is associated with the SparkContext of the application and SparkContext can only be referenced to in the driver code. Your UDF which returns a DataFrame tries to reference to the SparkContext from the workers and not from the driver. So, why do you need to create a separate DataFrame for each row? If - you wish to later union the resulting DataFrames into one. - the first DataFrame is small enough. Then, you can simply collect the the DataFrame's content and use it as the filter to return the rows from the Mongodb. Here for parallelism, you need to rely on the connector your using to connect to Mongodb.

answered Dec 29 '19 at 14:53

undying_odyssey

71
3

I'm using mongo spark connector to read data. I can loop the row by collecting it first. What I want to know is is there a chance of parallelising that loop because each row has independent values. – Muhammed Saed Dec 30 '19 at 09:21
Well, about that loop in particular, it's not possible to purely parallelise a computation over collections in Python due to Global Interpreter Lock. However, you can use concurrent.futures to concurrently collect the DataFrames and the performance will be nearly as good as parallel execution of loop. – undying_odyssey Dec 30 '19 at 11:49
Not parallalize in python but in spark. I can use sc.parallelize() but not sure how proceed after that and I'm still trying to figure out what's happening when we use sc.parallelize() – Muhammed Saed Dec 30 '19 at 16:05

pyspark map each row in dataframe and apply UDF which return dataframe

1 Answers1