0

I am a noob and I have previously tackled a linear regression problem using regularised methods. That was all pretty straight forward but I now want to use elastic net on a classification problem.

I have run a baseline logistic regression model and the prediction scores are decent (accuracy and f1 score of ~80%). I know that some of my input features are highly correlated and I suspect that I am introducing multicollinearity, hence why I want to run an elastic net to see the impact on the coefficients and compare against the baseline.

I have done some googling and I understand I need to use SGDClassifier function for regularised logistics regression model. Is this the best way to perform this analysis and can anyone point me in the direction of a basic example with cross validation?

DSouthy
  • 169
  • 1
  • 3
  • 12

1 Answers1

2

Interesting question.

Your question really should be broken up into multiple other questions, such as:

  • How can I tell if my data is collinear?
  • How to deal with collinear data in a machine learning problem?
  • How can I convert logistic regression to elasticnet for classification?

I am going to focus on the third bullet above.

Additionally, there is no sample data, or even a minimum, complete, reproducible example of code for us to go off of, so I am going to make some assumptions below.

How can I use logistic regression for classification?

What's the difference between logistic regression and elasticnet?

First, let's understand what is different about logistic regression vs elasticnet. This TowardsDataScience article is fairly well written and goes into the details a little bit, and you should review it if you are unfamiliar. In short,

Logistic Regression does not penalize the model for its weight choices, while elasticnet includes absolute value, and squared penalization tactics which are regularized with an l1_ratio coefficient.

What does that difference look like in code?

You can review the source code for Logistic Regression here, but in short, lines 794-796 show the alpha and beta values changing when the penalty type is elasticnet:

What does this mean for an example?

Below is an example of implementing this in code using sklearn's Logistic Regression. Some notes:

  • I am using cross validation as requested, and have set it to 3 folds
  • I would take this performance with a grain of salt -- there is a lot of feature engineering which should be done, and parameters such as the l1_ratios should absolutely be investigated. These values were totally arbitrary.

Produces outputs that look like:

Logistic Regression: 0.972027972027972 || Elasticnet: 0.9090909090909091

Logistic Regression
              precision    recall  f1-score   support

           0       0.96      0.96      0.96        53
           1       0.98      0.98      0.98        90

    accuracy                           0.97       143
   macro avg       0.97      0.97      0.97       143
weighted avg       0.97      0.97      0.97       143

Elastic Net
              precision    recall  f1-score   support

           0       0.93      0.81      0.87        53
           1       0.90      0.97      0.93        90

    accuracy                           0.91       143
   macro avg       0.92      0.89      0.90       143
weighted avg       0.91      0.91      0.91       143

Code below:

# Load libraries

# Load a toy dataset
from sklearn.datasets import load_breast_cancer

# Load the LogisticRegression classifier
# Note, use CV for cross-validation as requested in the question
from sklearn.linear_model import LogisticRegressionCV

# Load some other sklearn functions
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

# Import other libraries
import pandas as pd, numpy as np

# Load the breast cancer dataset
X, y = load_breast_cancer(return_X_y=True, as_frame=True)

# Create your training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=2)

# Basic LogisticRegression algorithm
logistic_regression_classifier = LogisticRegressionCV(cv=3)
# SAGA should be considered more advanced and used over SAG. For more information, see: https://stackoverflow.com/questions/38640109/logistic-regression-python-solvers-defintions
# Note, you should probably tune this, these values are arbitrary
elastic_net_classifier = LogisticRegressionCV(cv=3, penalty='elasticnet', l1_ratios=[0.1, 0.5, 0.9], solver='saga')

# Train the models
logistic_regression_classifier.fit(X_train, y_train)
elastic_net_classifier.fit(X_train, y_train)

# Test the models
print("Logistic Regression: {} || Elasticnet: {}".format(logistic_regression_classifier.score(X_test, y_test), elastic_net_classifier.score(X_test, y_test)))

# Print out some more metrics
print("Logistic Regression")
print(classification_report(y_test, logistic_regression_classifier.predict(X_test)))
print("Elastic Net")
print(classification_report(y_test, elastic_net_classifier.predict(X_test)))

There is alternatively another method you can use, similarly to how the RidgeClassifierCV functions, but we would need to write a bit of a wrapper around that as sklearn has not provided that.

artemis
  • 6,857
  • 11
  • 46
  • 99