Interesting question.
Your question really should be broken up into multiple other questions, such as:
- How can I tell if my data is collinear?
- How to deal with collinear data in a machine learning problem?
- How can I convert logistic regression to elasticnet for classification?
I am going to focus on the third bullet above.
Additionally, there is no sample data, or even a minimum, complete, reproducible
example of code for us to go off of, so I am going to make some assumptions below.
How can I use logistic regression for classification?
What's the difference between logistic regression and elasticnet?
First, let's understand what is different about logistic regression vs elasticnet. This TowardsDataScience article is fairly well written and goes into the details a little bit, and you should review it if you are unfamiliar. In short,
Logistic Regression does not penalize the model for its weight choices, while elasticnet includes absolute value, and squared penalization tactics which are regularized with an l1_ratio
coefficient.
What does that difference look like in code?
You can review the source code for Logistic Regression here, but in short, lines 794-796
show the alpha
and beta
values changing when the penalty type is elasticnet:
What does this mean for an example?
Below is an example of implementing this in code using sklearn's Logistic Regression
. Some notes:
- I am using cross validation as requested, and have set it to 3 folds
- I would take this performance with a grain of salt -- there is a lot of feature engineering which should be done, and parameters such as the
l1_ratios
should absolutely be investigated. These values were totally arbitrary.
Produces outputs that look like:
Logistic Regression: 0.972027972027972 || Elasticnet: 0.9090909090909091
Logistic Regression
precision recall f1-score support
0 0.96 0.96 0.96 53
1 0.98 0.98 0.98 90
accuracy 0.97 143
macro avg 0.97 0.97 0.97 143
weighted avg 0.97 0.97 0.97 143
Elastic Net
precision recall f1-score support
0 0.93 0.81 0.87 53
1 0.90 0.97 0.93 90
accuracy 0.91 143
macro avg 0.92 0.89 0.90 143
weighted avg 0.91 0.91 0.91 143
Code below:
# Load libraries
# Load a toy dataset
from sklearn.datasets import load_breast_cancer
# Load the LogisticRegression classifier
# Note, use CV for cross-validation as requested in the question
from sklearn.linear_model import LogisticRegressionCV
# Load some other sklearn functions
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
# Import other libraries
import pandas as pd, numpy as np
# Load the breast cancer dataset
X, y = load_breast_cancer(return_X_y=True, as_frame=True)
# Create your training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=2)
# Basic LogisticRegression algorithm
logistic_regression_classifier = LogisticRegressionCV(cv=3)
# SAGA should be considered more advanced and used over SAG. For more information, see: https://stackoverflow.com/questions/38640109/logistic-regression-python-solvers-defintions
# Note, you should probably tune this, these values are arbitrary
elastic_net_classifier = LogisticRegressionCV(cv=3, penalty='elasticnet', l1_ratios=[0.1, 0.5, 0.9], solver='saga')
# Train the models
logistic_regression_classifier.fit(X_train, y_train)
elastic_net_classifier.fit(X_train, y_train)
# Test the models
print("Logistic Regression: {} || Elasticnet: {}".format(logistic_regression_classifier.score(X_test, y_test), elastic_net_classifier.score(X_test, y_test)))
# Print out some more metrics
print("Logistic Regression")
print(classification_report(y_test, logistic_regression_classifier.predict(X_test)))
print("Elastic Net")
print(classification_report(y_test, elastic_net_classifier.predict(X_test)))
There is alternatively another method you can use, similarly to how the RidgeClassifierCV
functions, but we would need to write a bit of a wrapper around that as sklearn
has not provided that.