1

I am trying to understand how the best coefficients are calculated in a logistic regression cross-validation, where the "refit" parameter is True. If I understand the docs correctly, the best coefficients are the result of first determining the best regularization parameter "C", i.e., the value of C that has the highest average score over all folds. Then, the best coefficients are simply the coefficients that were calculated on the fold that has the highest score for the best C. I assume that if the maximum score is achieved by several folds, the coefficients of these folds would be averaged to give the best coefficients (I didn't see anything on how this case is handled in the docs).

To test my understanding, I determined the best coefficients in two different ways:

  1. directly from the coef_ attribute of the fitted model, and
  2. from the coefs_paths attribute, which contains the path of the coefficients obtained during cross-validating across each fold and then across each C.

The results I get from 1. and 2. are similar but not identical, so I was hoping someone could point out what I am doing wrong here. Thanks!

An example to demonstrate the issue:

from sklearn.datasets import load_breast_cancer
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegressionCV 
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

# Set parameters
n_folds = 10
C_values = [0.001, 0.01, 0.05, 0.1, 1., 100.]

# Load and preprocess data
cancer = load_breast_cancer()
X, y = cancer.data, cancer.target
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
X_train_scaled = StandardScaler().fit_transform(X_train)

# Fit model
clf = LogisticRegressionCV(Cs=C_values, cv=n_folds, penalty='l1', 
                           refit=True, scoring='roc_auc', 
                           solver='liblinear', random_state=0,
                           fit_intercept=False)
clf.fit(X_train_scaled, y_train)

########################
# Get and plot coefficients using method 1
########################
coefs1 = clf.coef_
coefs1_series = pd.Series(coefs1.ravel(), index=cancer['feature_names'])
coefs1_series.sort_values().plot(kind="barh")

########################
# Get and plot coefficients using method 2
########################
# mean of scores of class "1"
scores = clf.scores_[1]
mean_scores = np.mean(scores, axis=0)
# Get index of the C that has the highest average score across all folds
best_C_idx = np.where(mean_scores==np.max(mean_scores))[0][0]
# Get index (here: indices) of the folds with highest scores for the 
# best C
best_folds_idx = np.where(scores[:, best_C_idx]==np.max(scores[:, best_C_idx]))[0]

paths = clf.coefs_paths_[1]  # has shape (n_folds, len(C_values), n_features)
coefs2 = np.squeeze(paths[best_folds_idx, best_C_idx, :])
coefs2 = np.mean(coefs2, axis=0)
coefs2_series = pd.Series(coefs2.ravel(), index=cancer['feature_names'])
coefs2_series.sort_values().plot(kind="barh")
mella
  • 21
  • 2
  • 4
  • It would be helpful to include example input data, and outputs, especially to illustrate how much the regression coefficients might vary between different folds. – rwp Mar 29 '18 at 18:10
  • @rwp What kind of example input are you thinking of? The example I posted uses scikit-learn's breast cancer dataset as input. If you run the example you can see the output (plots of coefs1 and coefs2), and that they are not equal (which can also be verified using numpy.array_equal(coefs1, coefs2). My question is basically how you could calculate/reproduce the best coefficients (given by clf.scores_) from the coefs_paths_ attribute, which contains the scores for all values of C on each fold. – mella Mar 29 '18 at 20:03

1 Answers1

0

I think this article answers your question: https://orvindemsy.medium.com/understanding-grid-search-randomized-cvs-refit-true-120d783a5e94.

The key point is the refit parameter of LogisticRegressionCV. According to sklearn (https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegressionCV.html)

refitbool, default=True
If set to True, the scores are averaged across all folds, and the coefs and the C that corresponds to the best score is taken, and a final refit is done using these parameters. Otherwise the coefs, intercepts and C that correspond to the best scores across folds are averaged.

Best.

Chanh Duc
  • 31
  • 3