How to parallelize a task in Apache Spark

Question

I have a certain task that needs to be ran on a result of dataframe aggregation. In pseudo code, this looks something like:

def perform_task(param1, param2):
  counts1 = someDF1.where(F.col('param1')==param1).groupBy(param2).count()
  counts2 = someDF2.where(F.col('param1')==param1).groupBy(param2).count()
  return algorithmResult(counts1.toPandas(), counts2.toPandas())

for param_set in all_params:
  print perform_tas(*param_set)

How can the following code be properly parallelized in Spark? Turning param_set into parallelized collection and doing a .map() won't work, since I'm accessing DataFrames inside the map function - so what's the "proper" way of doing this?

I'm rather new to Spark, so any suggestions are welcome. Thanks!

You can [use threading](https://stackoverflow.com/a/38049303/1560062), concurrent futures (I've created [a proof of concept patches a while ago](https://github.com/zero323/pyspark-asyncactions)) and `FAIR` scheduler / scheduling pools though the whole pattern you've shown, looks a bit sketchy. `filter` -> single `groupBy` and switch to `RDD` could be a better choice. — zero323, Sep 22 '17 at 15:21
@zero323 - it feels sketchy to me too :) Can you suggest a pseudo code of how to handle such scenario? Thanks! — reflog, Sep 23 '17 at 09:47
`algorithmResult(counts1.toPandas(), counts2.toPandas())` is probablly the biggest issue but the first thing you can do is to look at [ROLLUP](https://stackoverflow.com/q/37975227/1560062) to handle all `param2` options at the same time (grouping sets prune things further). Also filtering for each level seems obsolete. Just `groupBy("param1", param2)` and ignore parts which are not necessary. — zero323, Sep 23 '17 at 11:11

How to parallelize a task in Apache Spark

0 Answers0