14

I am new to scikit-learn, but it did what I was hoping for. Now, maddeningly, the only remaining issue is that I don't find how I could print (or even better, write to a small text file) all the coefficients it estimated, all the features it selected. What is the way to do this?

Same with SGDClassifier, but I think it is the same for all base objects that can be fit, with cross validation or without. Full script below.

import scipy as sp
import numpy as np
import pandas as pd
import multiprocessing as mp
from sklearn import grid_search
from sklearn import cross_validation
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import SGDClassifier


def main():
    print("Started.")
    # n = 10**6
    # notreatadapter = iopro.text_adapter('S:/data/controls/notreat.csv', parser='csv')
    # X = notreatadapter[1:][0:n]
    # y = notreatadapter[0][0:n]
    notreatdata = pd.read_stata('S:/data/controls/notreat.dta')
    notreatdata = notreatdata.iloc[:10000,:]
    X = notreatdata.iloc[:,1:]
    y = notreatdata.iloc[:,0]
    n = y.shape[0]

    print("Data lodaded.")
    X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, y, test_size=0.4, random_state=0)

    print("Data split.")
    scaler = StandardScaler()
    scaler.fit(X_train)  # Don't cheat - fit only on training data
    X_train = scaler.transform(X_train)
    X_test = scaler.transform(X_test)  # apply same transformation to test data

    print("Data scaled.")
    # build a model
    model = SGDClassifier(penalty='elasticnet',n_iter = np.ceil(10**6 / n),shuffle=True)
    #model.fit(X,y)

    print("CV starts.")
    # run grid search
    param_grid = [{'alpha' : 10.0**-np.arange(1,7),'l1_ratio':[.05, .15, .5, .7, .9, .95, .99, 1]}]
    gs = grid_search.GridSearchCV(model,param_grid,n_jobs=8,verbose=1)
    gs.fit(X_train, y_train)

    print("Scores for alphas:")
    print(gs.grid_scores_)
    print("Best estimator:")
    print(gs.best_estimator_)
    print("Best score:")
    print(gs.best_score_)
    print("Best parameters:")
    print(gs.best_params_)


if __name__=='__main__':
    mp.freeze_support()
    main()
László
  • 3,914
  • 8
  • 34
  • 49

3 Answers3

21

The SGDClassifier instance fitted with the best hyperparameters is stored in gs.best_estimator_. The coef_ and intercept_ are the fitted parameters of that best model.

ogrisel
  • 39,309
  • 12
  • 116
  • 125
  • 2
    Thanks, I did not see `coef_` and `intercept_` listed among properties, so I missed that `gs` will have those too. (You did mean `gs.coef_` and not `gs.best_estimator_.coef_`, right? Though I should be able to test that.) – László Jun 24 '14 at 14:13
  • 4
    No I mean `gs.best_estimator_.coef_`. `gs.best_estimator_` is the best estimator found by the grid search (an instance of `SGDClassifier`). – ogrisel Jun 25 '14 at 13:56
  • it is possible to get all the coefficients for different tuning parameters? – doraemon Mar 16 '18 at 10:18
  • AFAIK the current API does allow to do this, unfortunately. You will probably have to implement your own grid search. – ogrisel Mar 18 '18 at 13:15
  • 2
    Did you mean 'API does not allow' rather than 'does allow'? – leonkato Mar 28 '18 at 00:59
  • Indeed, it does not allow, sorry for the confusion. – ogrisel Mar 31 '18 at 12:33
  • There is no `coef_` attribute, neither for gs nor for `best_estimator_` in scikit-learn-0.24.2. I still don't know how to extract coefficients from gs. – Ghislain Viguier Jun 15 '21 at 12:40
3
  • From an estimator, you can get the coefficients with coef_ attribute.
  • From a pipeline you can get the model with the named_steps attribute then get the coefficients with coef_.
  • From a grid search, you can get the model (best model) with best_estimator_, then get the named_steps to get the pipeline and then get the coef_.

Example:

from sklearn.svm import LinearSVC
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

pipe = Pipeline([
    ("scale", StandardScaler()),
    ("model", LinearSVC())
])

# from pipe:
pipe.fit(X, y);
coefs = pipe.named_steps.model.coef_

# from gridsearch:
gs_svc_model = GridSearchCV(estimator=pipe,
                    param_grid={
                      'model__C': [.01, .1, 10, 100, 1000],
                    },
                    cv=5,
                    n_jobs = -1)
gs_svc_model.fit(X, y);
coefs = gs_svc_model.best_estimator_.named_steps.model.coef_
1

I think you might be looking for estimated parameters of the "best" model rather than the hyper-parameters determined through grid-search. You can plug the best hyper-parameters from grid-search ('alpha' and 'l1_ratio' in your case) back to the model ('SGDClassifier' in your case) to train again. You can then find the parameters from the fitted model object.

The code could be something like this:

model2 = SGDClassifier(penalty='elasticnet',n_iter = np.ceil(10**6 / n),shuffle=True, alpha = gs.best_params_['alpha'], l1_ratio=gs.best_params_['l1_ratio'])
print(model2.coef_)
Ted
  • 83
  • 1
  • 5