I'm quite new to pyspark and not skilled python engineer trying to understand pandas UDF application for my case. I have developed ArimaX model, which for each "id" performs 4 outlook forecast (M1 till M4 ahead) while for each outlook 12 models are run (for last 12M takes previous 36M to predict 1 (or 2,3,4) months ahead and once done, RMSE is calculated over the 12 forecast made for of 4 models outlooks.
Core logic could be written as follows:
for id in id_list:
#filter df for id & do some calcs..
for outlook from range(0,4):
#prepare df for outlook range..
for per in per_range:
#perform modelling for all 12M periods; smt like a 1 per window movement
#calc rmse over 12M as prelim result & append forecast details in delta
In between I have various checks over relevancy of history, decision over exogenous variable, etc.
The problem is, that for one id the run of 4 forecast outlooks takes 1 min and I will have now 2000 ids to run, meaning ~30+ hour job run on cluster with single node.
Therefore, I would like to create pandas UDF to reduce the time and parallelize the job but I'm unsure how to handle the nested loops and that window movement in the most efficient manner.
I would appreciate any direction on what should be/is the best course of action so the solution is not an overkill for my set of skills (as well as time => business as usual). Maybe its a link to learning material I need to go thru first etc.
Thank you very much in advance!
I tried search online and here couldn't find adequate example.
Edit: After some research and with a "push" of Chris (thank you) I found out that:
- to have a function with parameters, I need to use wrapping; e.g. How do I pass multiple arguments to a Pandas UDF in PySpark? . This confused me a lot as a newbiee and differs from "common" function having parameters entered
- is good to use more functions in series based on purpose.
I was not able figure out how to elegantly pass parameters to udf (like 1 value per 1 od) so I added those as df column on input; this is not nice and column repeats values, but is reachable inside the udf for me.