I have a dataframe
with several number of rows. I can loop through this dataframe
using this code :
for row in df.rdd.collect():
But this is won't work in parallel right? So what I want is to map each row and pass it to UDF and return another new dataframe (from a DB) according to value in row.
I tried df.rdd.map(lambda row:read_from_mongo(row,spark)).toDF()
But I got this error:
_pickle.PicklingError: Could not serialize object: Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transformation. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063.
How do I loop a dataframe
in parallel and hold the dataframe
returning for each row?