3

I'm trying to use GaussianProcessRegressor in sklearn to predict values of unknown.
The target values are typically between 1000-10000.
Since they are not 0-mean prior, I set the model with normalize_y = False, which is a default setup.

from sklearn.gaussian_process import GaussianProcessRegressor

gpr = GaussianProcessRegressor(kernel = RBF, random_state=0, alpha=1e-10, normalize_y = False)

when I predicted unknown with the gpr model, the returned std values are unrealistically too small, like in the scale of 0.1, which is 0.001% of the predicted values.
When I changed the setting to normalize_y = True, the returned std values are more realistic, about 500ish.

Can someone explain exactly what normalize_y does here, and if I set it to True or False in this case?

tlentali
  • 3,407
  • 2
  • 14
  • 21
pjb
  • 31
  • 1

1 Answers1

1

I found the closest answer HERE: https://github.com/scikit-learn/scikit-learn/issues/15612

"OK I think I know what might be going on here. It's a bit tricky to see but I think that none of the kernels have a vertical length scale parameter, so kernel(x,x) is always equal to 1. All the diagonal elements of K are equal to 1 (before we add the ridge to it), for example.

We can then see that the variance of the predictions can only be between 0 and 1. For example, if we're predicting at a point far from the training data (so kernel(X, x_new) is a vector of zeros) then the variance is just

sigma^2 = kernel(x_new, x_new) = 1

I think the real problem here is that the prior is for data with unit variance, but the data doesn't have unit variance. The solution would be to normalise the data so that it has unit variance after it 'enters' the GP, conduct the GP analysis, and then 'unnormalise' it back again at the end. The code already removes the mean automatically, so I think we just need to divide by the standard deviation at the same point and it would work OK.

So could just need a few extra lines!"

For this reason, changing the length_scale_bounds parameter of your kernel should fix this issue!

I hope this helps those who land here as I faced the same issue!

  • sry, but I still get bit confused. do I need to set this one to be `True`, when my `y` is in `[0,1]`, with mean and var close to 0? and how to change length_scale_bounds properly? – G_cy Feb 28 '23 at 18:16