2

I have variable, and I need to predict its value as close as possible, but not greater than it. For example, given y_true = 9000, I want y_pred to be any value within range [0,9000] as close to 9000 as possible. And if y_true = 8000 respectively y_pred should be [0,8000]. That is, I want to make some kind of restriction on the predicted value. That threshold is individual for each pair of prediction and target variable from the sample. if y_true = [8750,9200,8900,7600] that y_pred should be [<=8750,<=9200,<=8900,<=7600]. The only task is to predict exactly no more and get closer. everywhere zero is considered the correct answer, but I just need to get as close as possible

data, target = np.array(data),np.array(df_tar)
X_train,X_test,y_train,y_test=train_test_split(data,target)
gbr = GradientBoostingRegressor(max_depth=1,n_estimators=100)
%time gbr.fit(X_train,np.ravel(y_train))
print(gbr.score(X_test,y_test),gbr.score(X_train,y_train))
  • This question, related to converting a sequence to a predefined range, might help: https://stackoverflow.com/questions/929103/convert-a-number-range-to-another-range-maintaining-ratio – blacksite Jul 08 '20 at 12:58
  • Please try to express the full logic in your posts without requiring extra-edits as it will force users to modify their answers too. – Celius Stingher Jul 08 '20 at 13:34

1 Answers1

2

Due to the complexity of actually changing and coming up with a model that can take this approach you desire into sklearn's function and apply it, I strongly suggest you pass this filter after the prediction, and replace all predicted values over 9000 to 9000. And afterwards, manually compute the score, which I believe is mse in this scenario.

Here is a full workinge example of my approach:

from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import mean_squared_error as mse
import numpy as np

X = [[8500,9500],[9200,8700],[8500,8250],[5850,8800]]
y = [8750,9200,8900,7600]
data, target = np.array(X),np.array(y)
gbr = GradientBoostingRegressor(max_depth=1,n_estimators=100)
gbr.fit(data,np.ravel(target))
predictions = gbr.predict(data)
print(predictions) ## The original predicitions

Output:

[8750.14958301 9199.23464805 8899.87846735 7600.73730159]

Perform the replacement:

fixed_predictions = np.array([z if y>z else y for y,z in zip(target,predictions)])
print(fixed_predictions)

[8750.         9199.23464805 8899.87846735 7600.        ]

Compute the new score:

score = mse(target,predictions)
print(score)

Output:

10000.145189724533
Celius Stingher
  • 17,835
  • 6
  • 23
  • 53
  • How about substracting the [maxY-9000] difference to all values ? Woudn't that introduce less noise ? (i only have 1 college year experience with ML) – Benoit F Jul 08 '20 at 13:24
  • You can certainly scale it to reduce noise, it's a good idea, but that won't avoid the algorithm (given the proper data) to make a prediction above the threshold the user desires. (For instance a poly fit, after scaling, might still overfit and predict a huge a value for an x slightly above the observed x's) – Celius Stingher Jul 08 '20 at 13:25
  • Let's say the highest Y score predicted is 9500, if i substract 500 to all the predictions they will all be below 9000 so it respects the treshold – Benoit F Jul 08 '20 at 13:28
  • Ah, sorry I misunderstood your point! It seems to be same as my approach, I'm just forcing them to 9000 instead of subtracting the difference, but in essence, it will turn all 9000+ predicted values to 9000 right? – Celius Stingher Jul 08 '20 at 13:32
  • The only difference is that it will keep the "order", for example 9400 will become 8900 while 9500 becomes 9000. But im not sure if that improves or not the model, thats why i asked :) – Benoit F Jul 08 '20 at 13:34
  • Yes sounds good, you would keep the difference as a ratio to be subtracted as a bias in the model. Off the top of my head I can't be certain if it'll improve the performance of the model, but it definitely can be worth testing! However the user updated this post with a different logic, it seems. – Celius Stingher Jul 08 '20 at 13:43
  • guys that threshold is individual for each pair of prediction and target variable from the sample. if y_true = [8750,9200,8900,7600] that y_pred should be [<=8750,<=9200,<=8900,<=7600] – Blackjack Jesus and others Jul 08 '20 at 13:44
  • Okay thanks for the opinion Celius. @BlackjackJesusandothers then Celius's answer will fit what you want – Benoit F Jul 08 '20 at 13:47
  • oy but how I will use it in practice? when I predict I don't know what I should get – Blackjack Jesus and others Jul 08 '20 at 13:57
  • the only task is to predict exactly no more and get closer. everywhere zero is considered the correct answer, but I just need to get as close as possible – Blackjack Jesus and others Jul 08 '20 at 14:00
  • This answer helps you achieve the expected output for known values, if you are trying to predict values with unknown values, then you can't use this because you wouldn't know `y_true`. What you can do in any case is fit the `fixed_result()` or add it as an extra feature for the X values. – Celius Stingher Jul 08 '20 at 14:06