1

All I need to do is, train two regression models (using scikit-learn) on the same data at the same time, using different cores. I've tried to figured out by myself using Process without success.

gb1 = GradientBoostingRegressor(n_estimators=10)
gb2 = GradientBoostingRegressor(n_estimators=100)

def train_model(model, data, target):
    model.fit(data, target)

live_data # Pandas DataFrame object
target # Numpy array object
p1 = Process(target=train_model, args=(gb1, live_data, target)) # same data
p2 = Process(target=train_model, args=(gb2, live_data, target)) # same data
p1.start()
p2.start()

If I run the code above I get the following error while trying to start the p1 process.

Traceback (most recent call last):
  File "<pyshell#28>", line 1, in <module>
    p1.start()
  File "C:\Python27\lib\multiprocessing\process.py", line 130, in start
    self._popen = Popen(self)
  File "C:\Python27\lib\multiprocessing\forking.py", line 274, in __init__
    to_child.close()
IOError: [Errno 22] Invalid argument

I'm running all this as a script (in IDLE) on Windows. Any suggestions on how should I proceed?

Alessandro Mariani
  • 1,181
  • 2
  • 10
  • 26
  • look at maybe using multiprocessing.Pool – Joran Beasley May 19 '13 at 04:12
  • 2
    try testing it by changing your target function and arguments to be really simple (i.e., `def myfun(*args): pass; Process(target=myfun, args=(1,))` and then see whether it fails. After that, bring in more elements. At least that way you can isolate where the issue is happening. – Jeff Tratner May 19 '13 at 04:51
  • hey jeff, i've done as you said defining myfun (pretty cool idea). firstly I've passed the GradientBoostingRegressor pointer and no problem. Then I've tried passing the data, which is actually a pandas dataframe and it faild with the same message. Then I've tried with the target variable (which is a numpy.ndarray) and got a totally new error message IOError: [Errno 32] Broken pipe – Alessandro Mariani May 19 '13 at 05:10
  • Why don't you use joblib? It is included in sklearn. – Andreas Mueller May 19 '13 at 14:32
  • Also: scikit-learn doesn't accept pandas DataFrame objects. – Andreas Mueller May 19 '13 at 14:40
  • thanks for the hint andreas, I couldn't find anything about using joblib around. I'll have a look - eventually posting a solution. About pandas dataframe, I always feed dataframe to fit\predict so it does accept them - I believe they're directly converted to numpy array\matrix – Alessandro Mariani May 19 '13 at 21:34

1 Answers1

3

Ok.. after hours spent in try to get this working, I'll post my solution. First thing. If you're on Windows and you're using the interactive intepreter you need to encapsualte all your code under the 'main' condition, at exeption of function definition and imports. This because when a new process will be spawned will go on loop.

My solution below:

from sklearn.ensemble import GradientBoostingRegressor
from multiprocessing import Pool
from itertools import repeat

def train_model(params):
    model, data, target = params
    # since Pool args accept once argument, we need to pass only one
    # and then unroll it as above
    model.fit(data, target)
    return model

if __name__ == '__main__':
    gb1 = GradientBoostingRegressor(n_estimators=10)
    gb2 = GradientBoostingRegressor(n_estimators=100)

    live_data # Pandas DataFrame object
    target    # Numpy array object

    po = Pool(2) # 2 is numbers of process we want to spawn
    gb, gb2 = po.map_async(train_model, 
                 zip([gb1,gb2], repeat(data), repeat(target))
                 # this will zip in one iterable object
              ).get()
    # get will start the processes and execute them
    po.terminate()
    # kill the spawned processes
Alessandro Mariani
  • 1,181
  • 2
  • 10
  • 26