How to add a range to sklearn's linear regression predictions

Question

I wonder if there is a way to add a range to the predictions prior to fitting the model.

The variable in question in my train data is technically a percentage score, but when I predict my test set, I get negative values or values >100.

For now, I am manually normalizing the predictions list. I also used to cut off negatives and >100 and assign then a 0 and 100.

However, it only makes sense if the fit function could be made aware of this constraint, right?

Here is a sample row of the data:

test_df = pd.DataFrame([[0, 40, 28, 30, 40, 22, 60, 40, 21, 0, 85, 29, 180, 85, 36, 741, 25.0]], columns=['theta_1', 'phi_1', 'value_1', 'theta_2', 'phi_2', 'value_2', 'theta_3', 'phi_3', 'value_3', 'theta_4', 'phi_4', 'value_4', 'theta_5', 'phi_5', 'value_5', 'sum_readings', 'estimated_volume'])

I have been reading and a lot of people consider this not a linear regression problem but their logic is not sound. Also, some say that one can apply a log scale but that only works in the case of comparison against a threshold, i.e., manual classification, i.e., using linear regression for a logistic regression problem! In my case, I need the percentages as they are the required output.

Your feedbacks/thought are much appreciated.

Here is the first element of my df: 0 0 40 28 30 40 22 60 40 21 90 ... 70 37 0 85 29 180 85 36 741 25.0 The first 0 is the index and the second is part of the data. — Arman Didandeh, Jul 13 '18 at 05:02
Please read [this](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) and edit the question to add a small reproducible example. — Marcus V., Jul 13 '18 at 06:33
Thanks, Marcus. Will give it a read and improve in that dimension for sure. — Arman Didandeh, Jul 13 '18 at 13:28

score 1 · Answer 1 · answered Jul 13 '18 at 08:21

Some algorithms will not propose out of range predicted values such as sklearn.neighbors.KNeighborsRegressor or sklearn.ensemble.RandomForestRegressor.

Linear Regressor can give out of target range values, here an example :

from sklearn.ensemble import RandomForestRegressor
import numpy as np
from sklearn.linear_model import LinearRegression

y = np.linspace(0,1,100)
X = 2* y
X = X.reshape(-1,1)

>>>> rf.predict(np.array([[4.]])), lr.predict(np.array([[4.]]))
# (array([0.9979798]), array([2.]))

but you can use a trick : you can map your [0, 1] space to [-inf, inf] space and came back in the initial space after prediction.

Here an example using sigmoid :

def sigmoid(x):
    return 1/(1+np.exp(-x))

def sigmoid_m1(x):
    return -np.log((1/x)-1)

rf = RandomForestRegressor()
lr = LinearRegression()
rf.fit(X,sigmoid_m1(y*0.9+0.05))
lr.fit(X,sigmoid_m1(y*0.9+0.05))
>>>> sigmoid(rf.predict(np.array([[4.]]))), sigmoid(lr.predict(np.array([[4.]])))
# (array([0.9457559]), array([0.99904361]))

Take care using this kind of solution because you totally change the distribution of the data and it can create a lot of problems.

Thank you, Guissart. Testing other models was in the backlog, but thanks for ensuring me that I was on the correct path. Regarding output mapping, the reason you mentioned is exactly why I was hesitating to up-map the output, and so I decided to simply normalize it. However, I believe that they both have the same effect on the distribution. — Arman Didandeh, Jul 13 '18 at 13:27

How to add a range to sklearn's linear regression predictions

1 Answers1

Linked