I am working on a Machine Learning model, using regression to predict future values for various categories of data. The data itself is quite complex, so I've included a sample below mimicking what I am trying to achieve:
df =
category date data
1 2021-06-19 94.9
1 2021-06-20 93.3
1 2021-06-21 91.6
... ... ...
2 2021-06-19 13.1
2 2021-06-20 11.9
2 2021-06-21 10.4
... ... ...
3 2021-06-19 53.9
3 2021-06-20 55.3
3 2021-06-21 59.3
... ... ...
I'm currently using a for loop, running my prediction model on each category:
categories = df.category.unique()
for category in categories:
# run my model
# save results
However, this is time consuming as I have ~4000 categories I am looping over. Each category prediction is independent of the others.
Is there a simple way to parallelise this work, rather than looping through for each category, performing the prediction sequentially?
Spark is a popular result when searching online, however this seems a big learning curve (and may lose some of the functionality accessible in python/pandas) and I'm hoping there is something I can use in the python libraries which may be more appropriate.