0

I'm trying to use scikit-learn to make predictions based on some client data, to determine a financial benefit estimate based on some answers they give us and based on our historical client projects.

My dataset looks like this:

 # Data (1-15 of 470)
 array(
    [[8662824,       34],
    [ 7978337,       25],
    [  902219,       28],
    [29890885,       64],
    [14357494,       60],
    [ 6403602,       43],
    [96538844,      372],
    [ 7675132,       67],
    [34807493,       78],
    [46215428,       75],
    [ 5437889,       20],
    [16674835,       50],
    [17382472,       20],
    [ 5437889,       20],
    [  313111,        0]])

 # Targets (1-15 of 470)
 array([2739267,   20539,   18304,   16052,   25391,   19444,   61550,
      94392,   75934,   52997,   67485,   92263,   37672, 6748523,
      20710])

There are 470 rows each in the actual data.

I'm using:

x_train, x_test, y_train, y_test = train_test_split(
    data,
    targets,
    test_size=.25,
    random_state=42
)
model = LogisticRegression(max_iter=5000)  # 5000 until I learn how to scale
model.fit(x_train, y_train)

# If I run model.predict(...), I get 30000, no matter what
model.predict([[50000, 50]]

Here's some actual shell output (see the score, also):

In [134]: model.predict([[16000000, 5]])
Out[134]: array([30000])

In [135]: model.predict([[150000, 20]])
Out[135]: array([30000])

In [138]: model.predict(np.array([[21500000000000, 2]]))
Out[138]: array([30000])

In [139]: model.predict(np.array([[21500000000000, -444444]]))
Out[139]: array([30000])

In [140]: model.predict([[2150000, 250]])
Out[140]: array([30000])

In [141]: model.score(x_test, y_test)
Out[141]: 0.009345794392523364

In [144]: model.n_iter_
Out[144]: array([4652], dtype=int32)

Here's some metadata from the model (via .__dict__):

{'penalty': 'l2',
 'dual': False,
 'tol': 0.0001,
 'C': 1.0,
 'fit_intercept': True,
 'intercept_scaling': 1,
 'class_weight': None,
 'random_state': None,
 'solver': 'lbfgs',
 'max_iter': 5000,
 'multi_class': 'auto',
 'verbose': 0,
 'warm_start': False,
 'n_jobs': None,
 'l1_ratio': None,
 'n_features_in_': 2,
 ...

There's definitely more of a relationship between the 2 data points than what a score of .0093 would seem to indicate. After all, we're currently using the same data to make predictions in our mind. Do you know what it is that I'm doing wrong, or even in what circumstance it would be normal for a trained model to return the same answer always?

Ken White
  • 123,280
  • 14
  • 225
  • 444
orokusaki
  • 55,146
  • 59
  • 179
  • 257

2 Answers2

1

LogisticRegression is for predicting a multi-class discrete target.

Since your target seems to be a continuous variable, you should use instead LinearRegression :

from sklearn.linear_model import LinearRegression

model = LinearRegression() 

More info on this post.

Mattravel
  • 1,358
  • 1
  • 15
1

Your target value is a continuous variable so you need to use a regression model. For a simple regression model, you can use a linear regression or a decision tree. If you want a more complex model, you can use a random forest or a gradient boosting. If you use the linear regression model, don't forget to scale your features with a standard scaler or a robust scaler

Pierre-Loic
  • 1,524
  • 1
  • 6
  • 12
  • Thank you so much. I now understand the difference between logistic and linear regression. How does random forest differ from linear regression, in terms of general high level utility? – orokusaki Feb 24 '23 at 13:01
  • 1
    Linear regression is a simple model with a small number of hyperparameters that you can use as a baseline model to compare the metrics with more complex models (you can also use Ridge, Lasso or Elasticnet which are Linear regression with different kinds of regularisation). On the other hand, random forest is an ensemble model with a big number of hyperparameters to tune (for instance, the number of decision trees in the random forest or the size of the trees in the random forest) – Pierre-Loic Feb 24 '23 at 16:38