0

I have a certain task that needs to be ran on a result of dataframe aggregation. In pseudo code, this looks something like:

def perform_task(param1, param2):
  counts1 = someDF1.where(F.col('param1')==param1).groupBy(param2).count()
  counts2 = someDF2.where(F.col('param1')==param1).groupBy(param2).count()
  return algorithmResult(counts1.toPandas(), counts2.toPandas())

for param_set in all_params:
  print perform_tas(*param_set)

How can the following code be properly parallelized in Spark? Turning param_set into parallelized collection and doing a .map() won't work, since I'm accessing DataFrames inside the map function - so what's the "proper" way of doing this?

I'm rather new to Spark, so any suggestions are welcome. Thanks!

reflog
  • 7,587
  • 1
  • 42
  • 47
  • You can [use threading](https://stackoverflow.com/a/38049303/1560062), concurrent futures (I've created [a proof of concept patches a while ago](https://github.com/zero323/pyspark-asyncactions)) and `FAIR` scheduler / scheduling pools though the whole pattern you've shown, looks a bit sketchy. `filter` -> single `groupBy` and switch to `RDD` could be a better choice. – zero323 Sep 22 '17 at 15:21
  • It seems `param2` is never used – MaFF Sep 22 '17 at 15:24
  • @zero323 - it feels sketchy to me too :) Can you suggest a pseudo code of how to handle such scenario? Thanks! – reflog Sep 23 '17 at 09:47
  • `algorithmResult(counts1.toPandas(), counts2.toPandas())` is probablly the biggest issue but the first thing you can do is to look at [ROLLUP](https://stackoverflow.com/q/37975227/1560062) to handle all `param2` options at the same time (grouping sets prune things further). Also filtering for each level seems obsolete. Just `groupBy("param1", param2)` and ignore parts which are not necessary. – zero323 Sep 23 '17 at 11:11

0 Answers0