I have a training set with one feature (credit balance) - numbers varying between 0-20,000. The response is either 0 (Default=No) or 1 (Default=Yes). This was a simulated training set generated using logistic function. For reference it is available here.
The following boxplot shows the distribution of the balance for default=yes and default=no classes respectively -
The following is the distribution of the data -
Also the dataset is perfectly balanced with 50% data for each response class. So it is a classic case suitable for application of Logistic Regression. However, on applying Logistic regression the score comes out to be 0.5 because only y=1 is being predicted. The following is the way in which Logistic Regression is being applied -
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression().fit(df[['Balance']],df['Default'])
clf.score(df[['Balance']], df['Default'])
This is proof that something must be off with the way Logistic Regression fits this data. When the balance feature is scaled though, the score improves to 87.5%. So does scaling play a factor here?
Edit: Why does scaling play a factor here? The documentation of Logistic Regression in sklearn says that lbfgs
solver is robust to unscaled data.