1

I have a training set with one feature (credit balance) - numbers varying between 0-20,000. The response is either 0 (Default=No) or 1 (Default=Yes). This was a simulated training set generated using logistic function. For reference it is available here.

The following boxplot shows the distribution of the balance for default=yes and default=no classes respectively - enter image description here

The following is the distribution of the data -

enter image description here

Also the dataset is perfectly balanced with 50% data for each response class. So it is a classic case suitable for application of Logistic Regression. However, on applying Logistic regression the score comes out to be 0.5 because only y=1 is being predicted. The following is the way in which Logistic Regression is being applied -

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression().fit(df[['Balance']],df['Default'])
clf.score(df[['Balance']], df['Default'])

This is proof that something must be off with the way Logistic Regression fits this data. When the balance feature is scaled though, the score improves to 87.5%. So does scaling play a factor here?

Edit: Why does scaling play a factor here? The documentation of Logistic Regression in sklearn says that lbfgs solver is robust to unscaled data.

Anirban Chakraborty
  • 539
  • 1
  • 5
  • 15

1 Answers1

2

Not only this, If you scale it to any value, i.e. df['balances']/2 or df['balances']/1000 or df['balance']*2, all would probably give ~87% accuracy, depending on random state selected by default it'd give 87% or 50%

The underlying implementation uses a random number generator to fit model, so not uncommon to have different solutions, in the case in question the classes are not linearly seperable, so it might not give a solution and it definitely won't give you a good solution always.

You can find the solution when you change the random state parameter, hence it is probably a good idea to score the model multiple times to get an average of performance

[EDIT] Also https://scikit-learn.org/stable/modules/linear_model.html#liblinear-differences is mentioned solver's robustness to not scaling and speed on large datasets

Harsh Sharma
  • 183
  • 1
  • 10
  • thanks. I have two points. First, the documentation referred to in the answer says that lbfgs solver is robust to unscaled datasets. This seems to be challenged as scaling drastically improves the score. And this has nothing to do with random_state as I tried 100 random integers and every time same result came. Also scaling the data did not change the nature of the data distribution (2nd figure in question). So why does the scaling work out here? Does it have something to do with the way maximum likelihood is implemented? – Anirban Chakraborty Jun 28 '21 at 08:37
  • Where does the 87% come from (that's what I get....)? – jtlz2 Feb 21 '22 at 18:25
  • (I have also found several papers at that accuracy and am now questioning their results!) @AnirbanChakraborty – jtlz2 Feb 22 '22 at 06:55