0

I have two dataframes - one with predictors (df_learn), one with targets ( target_learn). I want to create a list of scikit-learn models (ml_list), one per target. So far, I have written this.

import pandas as pd
import numpy as np
from sklearn.ensemble import GradientBoostingRegressor as GBM

df_learn = pd.DataFrame({'x1':[0,0,0,1,1,1], 'x2':[1,0,1,0,1,0], 'x3':[1,1,0,0,0,0]})
target_learn = pd.DataFrame({'y1':[1,0,0,2,2,0], 'y2':[1,1,1,0,1,0]})
target_colnames = ['y1', 'y2']
ml_list = [GBM(n_estimators = 5, max_depth= 2, min_samples_split= 2, 
            learning_rate=0.1, loss = 'ls')]*2
for i in [0,1] :
    ml_list[i] = ml_list[i].fit(df_learn, target_learn[target_colnames[i]])

To check this, I created a list of predictions.

pred_list = []

for i in [0,1] :
    pred_list.append(ml_list[i].predict(df_learn))

pd.DataFrame.from_items(zip(target_colnames, pred_list))

The result surprised me, as I got the exact same predictions for both targets.

y1      y2
0.80317 0.80317
0.80317 0.80317
0.80317 0.80317
0.39366 0.39366
0.80317 0.80317
0.39366 0.39366

When I ran each model separately (without using a list), I had two distinct predictions.

m1 = GBM(n_estimators = 5, max_depth= 2, min_samples_split= 2, 
            learning_rate=0.1, loss = 'ls')
m2 = GBM(n_estimators = 5, max_depth= 2, min_samples_split= 2, 
            learning_rate=0.1, loss = 'ls')
m1 = m1.fit(df_learn, target_learn['y1'])
m2 = m2.fit(df_learn, target_learn['y2'])
p1 = m1.predict(df_learn)
p2 = m2.predict(df_learn)
pd.DataFrame.from_items(zip(target_colnames, [p1,p2]))

Which gave the following results.

y1       y2
0.710278 0.80317
0.608147 0.80317
0.567309 0.80317
0.901585 0.39366
1.311095 0.80317
0.901585 0.39366

Obviously, at least one of the for loop seems to overwrite the result of the previous member in the list. I assume this has to be related to some copy/deep-copy issue. How should I fix it ?

AshOfFire
  • 676
  • 5
  • 15

1 Answers1

2

When you do:

ml_list = [GBM(n_estimators = 5, max_depth= 2, min_samples_split= 2, 
            learning_rate=0.1, loss = 'ls')]*2

Python is making a shallow copy of the object to fill the list with 2 elements. But both of the variables essentially reference to the same underlying object.

So when you do this:

for i in [0,1] :
    ml_list[i] = ml_list[i].fit(df_learn, target_learn[target_colnames[i]])

The underlying GBM object is refit each time. So it only remembers the last call to fit, i.e. when i = 1

You can change that by adding two distinct objects to the list and not using * 2

ml_list = [GBM(n_estimators = 5, max_depth= 2, min_samples_split= 2, 
            learning_rate=0.1, loss = 'ls'),
           GBM(n_estimators = 5, max_depth= 2, min_samples_split= 2, 
            learning_rate=0.1, loss = 'ls')]

Then you will get different results for each i as you are getting above.

See this question for more information about shallow and deep copies:-

To fill the list with same type of estimators, scikit has a clone method which will return a new object same as provided. You can do:

est = GBM(n_estimators = 5, max_depth= 2, min_samples_split= 2, 
            learning_rate=0.1, loss = 'ls')
ml_list = []

from sklearn.base import clone
for i in range(50):
    ml_list.append(clone(est))
Vivek Kumar
  • 35,217
  • 8
  • 109
  • 132
  • That was my thought at some point. So basically it's impossible to create the list without typing n times the number of models you need ? (For the purpose of the example, I left 2 targets but actually, I am dealing with 12 targets, and may have to do it for potentially >50 ones...) – AshOfFire Apr 06 '18 at 09:10
  • @AshOfFire I have edited the answer to include the solution to your use-case – Vivek Kumar Apr 06 '18 at 09:15