2

I am using sklearn modules to find the best fitting models and model parameters. However, I have an unexpected Index error down below:

> IndexError                                Traceback (most recent call
> last) <ipython-input-38-ea3f99e30226> in <module>
>      22             s = mean_squared_error(y[ts], best_m.predict(X[ts]))
>      23             cv[i].append(s)
> ---> 24     print(np.mean(cv, 1))
> IndexError: tuple index out of range

what I want to do is to find best fitting regressor and its parameters, but I got above error. I looked into SO and tried this solution but still, same error bumps up. any idea to fix this bug? can anyone point me out why this error happening? any thought?

my code:

from sklearn.model_selection import KFold
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from xgboost.sklearn import XGBRegressor

from sklearn.datasets import make_regression

models = [SVR(), RandomForestRegressor(), LinearRegression(), Ridge(), Lasso(), XGBRegressor()]
params = [{'C': [0.01, 1]}, {'n_estimators': [10, 20]}]

X, y = make_regression(n_samples=10000, n_features=20)

with warnings.catch_warnings():
    warnings.filterwarnings("ignore")
    cv = [[] for _ in range(len(models))]
    fold = KFold(5,shuffle=False)
    for tr, ts in fold.split(X):
        for i, (model, param) in enumerate(zip(models, params)):
            best_m = GridSearchCV(model, param)
            best_m.fit(X[tr], y[tr])
            s = mean_squared_error(y[ts], best_m.predict(X[ts]))
            cv[i].append(s)
    print(np.mean(cv, 1))

desired output:

if there is a way to fix up above error, I am expecting to pick up best-fitted models with parameters, then use it for estimation. Any idea to improve the above attempt? Thanks

desertnaut
  • 57,590
  • 26
  • 140
  • 166
Jerry07
  • 929
  • 1
  • 10
  • 28
  • @desertnaut Do you think how can I optimize this code? any better idea? – Jerry07 Jul 16 '19 at 16:47
  • That's a very general question, but doing a grid search in *each* one of 5 folds sounds like overkill. I kindly suggest you open another question asking for advice in this (be sure to make your code fully reproducible, including all relevant imports). – desertnaut Jul 16 '19 at 16:56
  • The error can be reproduced with `np.mean([],1)`, which supports the idea the `cv` is `[]`, or contains `[]` lists. – hpaulj Jul 16 '19 at 17:59

2 Answers2

3

When you define

cv = [[] for _ in range(len(models))]

it has an empty list for each model. In the loop, however, you go over enumerate(zip(models, params)) which has only two elements, since your params list has two elements (because list(zip(x,y)) has length equal to min(len(x),len(y)).

Hence, you get an IndexError because some of the lists in cv are empty (all but the first two) when you calculate the mean with np.mean.

Solution: If you don't need to use GridSearchCV on the remaining models you may just extend the params list with empty dictionaries:

params = [{'C': [0.01, 1]}, {'n_estimators': [10, 20]}, {}, {}, {}, {}]
Psi
  • 83
  • 1
  • 7
  • I don't think this is the answer for this question. Please read `SO` community rule. – Jerry07 Jul 16 '19 at 16:14
  • @Dan Since you haven't posted a MWE I can't verify with certainty that this is the solution, but it works with your code after importing the appropiate modules and it matches the output you gave in the comments for `cv` (see the last edit for the specific change you would have to make to `params`). – Psi Jul 16 '19 at 16:17
  • This is the correct answer indeed (upvoted) - can't understand the downvotes; I proceed to explain in more detail... – desertnaut Jul 16 '19 at 16:32
2

The root cause of your issue is that, while you ask for the evaluation of 6 models in GridSearchCV, you provide parameters only for the first 2 ones:

models = [SVR(), RandomForestRegressor(), LinearRegression(), Ridge(), Lasso(), XGBRegressor()]
params = [{'C': [0.01, 1]}, {'n_estimators': [10, 20]}]

The result of enumerate(zip(models, params)) in this setting, i.e:

for i, (model, param) in enumerate(zip(models, params)):
    print((model, param))

is

(SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilon=0.1, gamma='auto',
  kernel='rbf', max_iter=-1, shrinking=True, tol=0.001, verbose=False), {'C': [0.01, 1]})
(RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
           oob_score=False, random_state=None, verbose=0, warm_start=False), {'n_estimators': [10, 20]})

i.e the last 4 models are simply ignored, so you get empty entries for them in cv:

print(cv)
# result:
[[5950.6018771284835, 5987.293514740653, 6055.368320208183, 6099.316091619069, 6146.478702335218], [3625.3243553665975, 3301.3552182952058, 3404.3321983193728, 3521.5160621260898, 3561.254684271113], [], [], [], []]

which causes the downstream error when trying to get the np.mean(cv, 1).

The solution, as already correctly pointed out by Psi in their answer, is to go for empty dictionaries in the models in which you actually don't perform any CV search; omitting the XGBRegressor (have not installed it), here are the results:

models = [SVR(), RandomForestRegressor(), LinearRegression(), Ridge(), Lasso()]
params2 = [{'C': [0.01, 1]}, {'n_estimators': [10, 20]}, {}, {}, {}]

cv = [[] for _ in range(len(models))]
fold = KFold(5,shuffle=False)
for tr, ts in fold.split(X):
    for i, (model, param) in enumerate(zip(models, params2)):
        best_m = GridSearchCV(model, param)
        best_m.fit(X[tr], y[tr])
        s = mean_squared_error(y[ts], best_m.predict(X[ts]))
        cv[i].append(s)

where print(cv) gives:

[[4048.660483326826, 3973.984055352062, 3847.7215568088545, 3907.0566348092684, 3820.0517432992765], [1037.9378737329769, 1025.237441119364, 1016.549294695313, 993.7083268195154, 963.8115632611381], [2.2948917095935095e-26, 1.971022007799432e-26, 4.1583774042712844e-26, 2.0229469068846665e-25, 1.9295075684919642e-26], [0.0003350178681602639, 0.0003297411022124562, 0.00030834076832371557, 0.0003355298330301431, 0.00032049282437794516], [10.372789356303688, 10.137748082073076, 10.136028304131141, 10.499159069700834, 9.80779910439471]]

and print(np.mean(cv, 1)) works OK, giving:

[3.91949489e+03 1.00744890e+03 6.11665355e-26 3.25824479e-04
 1.01907048e+01]

So, in your case, you should indeed change params to:

params = [{'C': [0.01, 1]}, {'n_estimators': [10, 20]}, {}, {}, {}, {}]

as already suggested by Psi.

desertnaut
  • 57,590
  • 26
  • 140
  • 166