I have a certain task that needs to be ran on a result of dataframe aggregation. In pseudo code, this looks something like:
def perform_task(param1, param2):
counts1 = someDF1.where(F.col('param1')==param1).groupBy(param2).count()
counts2 = someDF2.where(F.col('param1')==param1).groupBy(param2).count()
return algorithmResult(counts1.toPandas(), counts2.toPandas())
for param_set in all_params:
print perform_tas(*param_set)
How can the following code be properly parallelized in Spark? Turning param_set into parallelized collection and doing a .map() won't work, since I'm accessing DataFrames inside the map function - so what's the "proper" way of doing this?
I'm rather new to Spark, so any suggestions are welcome. Thanks!