0

I am working on a Machine Learning model, using regression to predict future values for various categories of data. The data itself is quite complex, so I've included a sample below mimicking what I am trying to achieve:

df = 
category       date   data
1        2021-06-19   94.9
1        2021-06-20   93.3
1        2021-06-21   91.6
...      ...          ...
2        2021-06-19   13.1
2        2021-06-20   11.9
2        2021-06-21   10.4
...      ...          ...
3        2021-06-19   53.9
3        2021-06-20   55.3
3        2021-06-21   59.3
...      ...          ...

I'm currently using a for loop, running my prediction model on each category:

categories = df.category.unique()

for category in categories:
  # run my model
  # save results

However, this is time consuming as I have ~4000 categories I am looping over. Each category prediction is independent of the others.

Is there a simple way to parallelise this work, rather than looping through for each category, performing the prediction sequentially?

Spark is a popular result when searching online, however this seems a big learning curve (and may lose some of the functionality accessible in python/pandas) and I'm hoping there is something I can use in the python libraries which may be more appropriate.

desertnaut
  • 57,590
  • 26
  • 140
  • 166
MattC
  • 11
  • 1
  • Please post the code inside the loop, i.e. what do you do with each category. – gtomer Aug 08 '21 at 17:29
  • This sounds parallelizable, and CPU is cheap, so it boils down to what you want to learn and do -- and we can't dig that out of your head. A 96 CPU machine on demand from Google Cloud is about $5-6/hour ($1/hour if you can tolerate it being recalled), so the trick is to have a start-up script that downloads your stuff from cloud storage, gets it done, saves back to cloud storage, and terminates the machine. You can start with smaller, cheaper, machines and just change the number of CPUs on a GUI. The other cost is about $0.12/GB to download from cloud storage to places outside Google Cloud. – Paul Aug 08 '21 at 17:43

2 Answers2

1

You can do something like this

# The joblib module provides Parallel and delayed methods
from joblib import Parallel, delayed

'''
The Parallel method is to parallelize the process over n cores using the n_jobs argument (-1 means max possible value). The delayed function wraps the actual function and passes the values of a list to the function in parallel.
'''

def predict_cat(category):
    # category_dataset = # filter on basis of that category
    # preds = model.predict(category_dataset)
    # I will write some random preds as I don't have actual values
    preds = [1,2,3]
    with open('pred_file.txt', 'a') as file:
        file.write(str(category) + " " + str(preds) + "\n")

# Here instead of range you will use the list of unique categories. 
Parallel(n_jobs=-1)(delayed(predict_cat)(c) for c in range(5));

After the ending you can just read the values from the pred_file.

Abhishek Prajapat
  • 1,793
  • 2
  • 8
  • 19
1

Since Spark is not a preferred approach for you, what I can think of is two methods, let me share both with you,

  1. You can use Joblib

Python has this great package that makes parallelism incredibly easy. Refer: https://joblib.readthedocs.io/en/latest/

The basic usage pattern is:

from joblib import Parallel, delayed

def myfun(arg):
     do_stuff
     return result

results = Parallel(n_jobs=-1, verbose=verbosity_level, backend="threading")(
             map(delayed(myfun), arg_instances))

Now here, arg_instances is list of values for which myfun is computed in parallel. The main restriction is that myfun must be a toplevel function. The backend parameter can be either "threading" or "multiprocessing".

You can pass additional common parameters to the parallelized function. The body of myfun can also refer to initialized global variables, the values which will be available to the children.

Args and results can be pretty much anything with the threading backend but results need to be serializable with the multiprocessing backend.

  1. Numba

Numba can auto parallelize a for loop.

Refer: http://numba.pydata.org/numba-doc/latest/user/parallel.html#explicit-parallel-loops

from numba import jit, prange

@jit
def parallel_sum(A):
    sum = 0.0
    for i in prange(A.shape[0]):
        sum += A[i]

    return sum

Blog worth reading: http://blog.dominodatalab.com/simple-parallelization/

[Honorable Mention] Dask also offers similar functionality. It might be preferable if you are working with out of core data or you are trying to parallelize more complex computations. Refer: https://dask.org/

aryashah2k
  • 638
  • 1
  • 4
  • 16