I'm trying to use scikit-learn to make predictions based on some client data, to determine a financial benefit estimate based on some answers they give us and based on our historical client projects.
My dataset looks like this:
# Data (1-15 of 470)
array(
[[8662824, 34],
[ 7978337, 25],
[ 902219, 28],
[29890885, 64],
[14357494, 60],
[ 6403602, 43],
[96538844, 372],
[ 7675132, 67],
[34807493, 78],
[46215428, 75],
[ 5437889, 20],
[16674835, 50],
[17382472, 20],
[ 5437889, 20],
[ 313111, 0]])
# Targets (1-15 of 470)
array([2739267, 20539, 18304, 16052, 25391, 19444, 61550,
94392, 75934, 52997, 67485, 92263, 37672, 6748523,
20710])
There are 470 rows each in the actual data.
I'm using:
x_train, x_test, y_train, y_test = train_test_split(
data,
targets,
test_size=.25,
random_state=42
)
model = LogisticRegression(max_iter=5000) # 5000 until I learn how to scale
model.fit(x_train, y_train)
# If I run model.predict(...), I get 30000, no matter what
model.predict([[50000, 50]]
Here's some actual shell output (see the score, also):
In [134]: model.predict([[16000000, 5]])
Out[134]: array([30000])
In [135]: model.predict([[150000, 20]])
Out[135]: array([30000])
In [138]: model.predict(np.array([[21500000000000, 2]]))
Out[138]: array([30000])
In [139]: model.predict(np.array([[21500000000000, -444444]]))
Out[139]: array([30000])
In [140]: model.predict([[2150000, 250]])
Out[140]: array([30000])
In [141]: model.score(x_test, y_test)
Out[141]: 0.009345794392523364
In [144]: model.n_iter_
Out[144]: array([4652], dtype=int32)
Here's some metadata from the model (via .__dict__
):
{'penalty': 'l2',
'dual': False,
'tol': 0.0001,
'C': 1.0,
'fit_intercept': True,
'intercept_scaling': 1,
'class_weight': None,
'random_state': None,
'solver': 'lbfgs',
'max_iter': 5000,
'multi_class': 'auto',
'verbose': 0,
'warm_start': False,
'n_jobs': None,
'l1_ratio': None,
'n_features_in_': 2,
...
There's definitely more of a relationship between the 2 data points than what a score of .0093 would seem to indicate. After all, we're currently using the same data to make predictions in our mind. Do you know what it is that I'm doing wrong, or even in what circumstance it would be normal for a trained model to return the same answer always?