3

I have a loop that increases the degree of polynomial features each iteration. Currently, the loop writes over the model variable name and at the end of the loop, I only have access to the last model object I created:

logitCV = LogisticRegressionCV(class_weight='balanced', random_state=42, cv=5, scoring='accuracy')
comparisons = pd.DataFrame(columns = 'model data accuracy'.split())
dims = np.arange(1,4,1)

for i in dims:
    poly = PolynomialFeatures(degree=i,include_bias=False)
    X_poly_train = poly.fit_transform(X_train)
    X_poly_test = poly.fit_transform(X_test)    

    model = logitCV.fit(X_poly_train, y_train)
    train_score = model.score(X_poly_train,y_train)
    test_score = model.score(X_poly_test,y_test)

    model_name = 'dims_{}'.format(i)
    add_train = [model_name,'train',train_score]
    comparisons.loc[len(comparisons)] = add_train
    add_test = [model_name,'test',test_score]
    comparisons.loc[len(comparisons)] = add_test

How can I change the name of the model object each iteration?

Ideally, this would return a model object for each set of features I used. In the case above, there are three models (y = X; y = X+X^2; y = X+X^2+X^3), so there should be three model objects accessible at the end of the loop (model_1; model_2; model_3).

Thanks for the help!

Grr
  • 15,553
  • 7
  • 65
  • 85
NLR
  • 1,714
  • 2
  • 11
  • 21
  • 1
    Not related to your question, but if you're trying to perform a search over the degree of the polynomial feature, you might look into just using the `GridSearchCV` class and searching over the `degree` parameter. – TayTay Jul 03 '18 at 18:06
  • That's a good point @Tgsmith61591. I'll have to look into how that works. Thanks for the suggestion! – NLR Jul 03 '18 at 18:36

2 Answers2

2

I would suggest creating a list of models and appending each in the loop to the list.

Example:

models = []
# ...
for i in dims:
    # ...
    model = logitCV.fit(X_poly_train, y_train)
    models += [model]

Then you will have access to each of the models in that list after the loop is over.

bzier
  • 445
  • 3
  • 11
  • thanks for the suggestion, but how do I change the model object name based on the iterator (i)? – NLR Jul 03 '18 at 20:16
  • I'm not sure I understand why you would need to. You can access the individual model from the list based on the iterator with `models[i]`. If you know ahead of time how many you will have, you could just assign them directly after the loop with `model_1 = models[0]`, etc. – bzier Jul 03 '18 at 21:57
  • You may also want to check out the answer [here](https://stackoverflow.com/a/1373185/9526448) about using dictionaries. – bzier Jul 06 '18 at 17:52
2

What you are trying to accomplish is known as grid search. Sklearn has a builtin class GridSearchCV that can be used for this exact purpose. While you won't get back a list of models you be able to view the results of each model and access the best performing model. In order to use this with PolynomialFeatures I would also encourage the use of Pipeline. For example:

from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures


iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target)
pipe = Pipeline(steps=[('poly', PolynomialFeatures()), ('lr', LogisticRegression())])
params = {'poly__degree': np.arange(1, 4)}
gs = GridSearchCV(pipe, params, return_train_score=True)
gs.fit(X_train, y_train)
GridSearchCV(cv=None, error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('poly', PolynomialFeatures(degree=2, include_bias=True, interaction
_only=False)), ('lr', LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1
, normalize=False))]),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'poly__degree': [1, 2, 3, 4]}, pre_dispatch='2*n_jobs',
       refit=True, return_train_score='warn', scoring=None, verbose=0)
gs.cv_results
{'mean_fit_time': array([0.00133387, 0.00099603, 0.00133324, 0.00199993]),
 'mean_score_time': array([0.00066773, 0.00099413, 0.00100025, 0.00100017]),
 'mean_test_score': array([  0.90775274,   0.91685398,   0.80601582, -40.5437895
4]),
 'mean_train_score': array([0.92144066, 0.95029226, 0.95571164, 0.98727079]),
 'param_poly__degree': masked_array(data=[1, 2, 3, 4],
              mask=[False, False, False, False],
        fill_value='?',
             dtype=object),
 'params': [{'poly__degree': 1},
  {'poly__degree': 2},
  {'poly__degree': 3},
  {'poly__degree': 4}],
 'rank_test_score': array([2, 1, 3, 4]),
 'split0_test_score': array([  0.88284837,   0.88510265,   0.73325603, -10.01478
051]),
 'split0_train_score': array([0.93086987, 0.96444943, 0.98005722, 0.99820903]),
 'split1_test_score': array([  0.92250837,   0.9227331 ,   0.88028476, -12.49501
116]),
 'split1_train_score': array([0.91665687, 0.94718893, 0.96290854, 0.99867128]),
 'split2_test_score': array([  0.91857458,   0.94358434,   0.80647314, -99.94668
53 ]),
 'split2_train_score': array([0.91679523, 0.93923843, 0.92416916, 0.96493206]),
 'std_fit_time': array([4.70942072e-04, 5.50718821e-06, 4.71538951e-04, 1.123915
96e-07]),
 'std_score_time': array([4.72159663e-04, 7.86741172e-06, 1.12391596e-07, 1.9466
7955e-07]),
 'std_test_score': array([1.79179093e-02, 2.42798791e-02, 6.01535692e-02, 4.1735
5600e+01]),
 'std_train_score': array([0.0066677 , 0.01052367, 0.02337684, 0.01579699])}
gs.best_estimator_
Pipeline(memory=None,
     steps=[('poly', PolynomialFeatures(degree=2, include_bias=True, interaction
_only=False)), ('lr', LogisticRegression(C=1.0, class_weight=None, dual=False, f
it_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False))])
gs.best_estimator.score(X_test, y_test)
0.9736842105263158
Grr
  • 15,553
  • 7
  • 65
  • 85
  • thanks for this. I have a question about `poly__degree`: is this the standard format for accessing the specific parameters of steps in the pipeline? For example, what if I also want to find the optimal regularization coefficient `C`? Would I simply add `params = {'poly__degree' : np.arange(1,4), 'logreg__C' : np.logspace(-5, 8, 15)}`? In other words, do I simply use the double-underscore notation to create the GridSearch parameters `step__paramater`? – NLR Jul 03 '18 at 21:29
  • 2
    yes the access pattern is name__parameter where name is the name you assigned to your pipeline step. For my example it would be `lr__C` – Grr Jul 03 '18 at 21:35