Does sklearn LogisticRegressionCV use all data for final model

Question

I was wondering how the final model (i.e. decision boundary) of LogisticRegressionCV in sklearn was calculated. So say I have some Xdata and ylabels such that

Xdata # shape of this is (n_samples,n_features)
ylabels # shape of this is (n_samples,), and it is binary

and now I run

from sklearn.linear_model import LogisticRegressionCV
clf = LogisticRegressionCV(Cs=[1.0],cv=5)
clf.fit(Xdata,ylabels)

This is looking at just one regularization parameter and 5 folds in the CV. So clf.scores_ will be a dictionary with one key with a value that is an array with shape (n_folds,1). With these five folds you can get a better idea of how the model performs.

However, I'm confused about what you get from clf.coef_ (and I'm assuming the parameters in clf.coef_ are the ones used in clf.predict). I have a few options I think it could be:

The parameters in clf.coef_ are from training the model on all the data
The parameters in clf.coef_ are from the best scoring fold
The parameters in clf.coef_ are averaged across the folds in some way.

I imagine this is a duplicate question, but for the life of me I can't find a straightforward answer online, in the sklearn documentation, or in the source code for LogisticRegressionCV. Some relevant posts I found are:

Vivek Kumar · Answer 1 · 2018-08-14T05:56:15.480

You are mistaking between hyper-parameters and parameters. All scikit-learn estimators which have CV in the end, like LogisticRegressionCV, GridSearchCV, or RandomizedSearchCV tune the hyper-parameters.

Hyper-parameters are not learnt from training on the data. They are set prior to learning assuming that they will contribute to optimal learning. More information is present here:

Hyper-parameters are parameters that are not directly learnt within estimators. In scikit-learn they are passed as arguments to the constructor of the estimator classes. Typical examples include C, kernel and gamma for Support Vector Classifier, alpha for Lasso, etc.

In case of LogisticRegression, C is a hyper-parameter which describes the inverse of regularization strength. The higher the C, the less regularization is applied on the training. Its not that C will be changed during training. It will be fixed.

Now coming to coef_. coef_ contains coefficient (also called weights) of the features, which are learnt (and updated) during the training. Now depending on the value of C (and other hyper-parameters present in contructor), these can vary during the training.

Now there is another topic on how to get the optimum initial values of coef_, so that the training is faster and better. Thats optimization. Some start with random weights between 0-1, others start with 0, etc etc. But for the scope of your question, that is not relevant. LogisticRegressionCV is not used for that.

This is what LogisticRegressionCV does:

Get the values of different C from constructor (In your example you passed 1.0).
For each value of C, do the cross-validation of supplied data, in which the LogisticRegression will be fit() on training data of the current fold, and scored on the test data. The scores from test data of all folds are averaged and that becomes the score of the current C. This is done for all C values you provided, and the C with the highest average score will be chosen.
Now the chosen C is set as the final C and LogisticRegression is again trained (by calling fit()) on the whole data (Xdata,ylabels here).

Thats what all the hyper-parameter tuners do, be it GridSearchCV, or LogisticRegressionCV, or LassoCV etc.

The initializing and updating of coef_ feature weights is done inside the fit() function of the algorithm which is out of scope for the hyper-parameter tuning. That optimization part is dependent on the internal optimization algorithm of the process. For example solver param in case of LogisticRegression.

Hope this makes things clear. Feel free to ask if still any doubt.

As a follow up to your answer, I think having the CV at the end of LosgisticRegression was confusing me. You are absolutely correct this is designed for hyperparameter tuning, and I just cared about using this CV to validate the model. In the end I think the functions found in [here](http://scikit-learn.org/stable/modules/cross_validation.html) (The cross_validate functions) are what I need since I only was using one C anyway. Thanks again for the super helpful comment! — Christopher Mancuso, Aug 14 '18 at 16:52

score 8 · Answer 2 · edited Dec 09 '20 at 18:43

8

You have the parameter refit=True by default. On the docs you can read:

If set to True, the scores are averaged across all folds, and the coefs and the C that corresponds to the best score is taken, and a final refit is done using these parameters. Otherwise the coefs, intercepts and C that correspond to the best scores across folds are averaged.

So if refit=True the CV model is retrained using all the data. When it says the final refit is done using these parameters it is talking about the C regularization parameter. So it uses the C that gives the best average score across the K folds.

When refit=False it retrieves you the best model in cross validation. So if you trained 5 folds, you will get the model (coeff + C + intercept), trained on 4 folds of data, which gave the best score on its fold test set. I agree that the documetation here is not very clear but averaging C values and coefficients does not really make much sense

edited Dec 09 '20 at 18:43

user1269942

3,772
23
33

answered Aug 13 '18 at 21:16

Gabriel M

1,486
4
17
25

Thanks for the fast reply, and sorry I overlooked this argument. However, I still don't see where it says it uses all the data. And when it says a "final refit is done using these parameters", does it just use the parameters from the fold that scored the best as the starting parameters and then updates them again during this refitting? Also, I'm still confused what happens when refit is false. Say in my example where there is one C and 5 folds. Does it just average the coef's from all 5 folds? If there is more than one C, does it pick the best C first and then average coef's across folds? – Christopher Mancuso Aug 14 '18 at 01:47
Thanks for updating your response, it makes a lot of sense now! – Christopher Mancuso Aug 14 '18 at 13:31

score 1 · Answer 3 · answered Apr 21 '22 at 19:29

I just took a look at the source code. It seems for refit = True, they just selected the best hyperparameter (C and l1_ratio) and retrain the model with all the data.

for refit = False:

It seems they do average the hyperparameters, see the blow source code:

best_indices = np.argmax(scores, axis=1)
...
best_indices_C = best_indices % len(self.Cs_)
self.C_.append(np.mean(self.Cs_[best_indices_C]))

Does sklearn LogisticRegressionCV use all data for final model

3 Answers3

Linked