What is the difference between these two ways of specifying training/testing data for sklearn GPR

Question

This is somewhat of a follow up to my previous question about evaluating my scikit Gaussian process regressor. I am very new to GPRs and I think that I may be making a methodological mistake in how I am using training vs testing data.

Essentially I'm wondering what the difference is between specifying training data by splitting the input between test and training data like this:

X = np.atleast_2d(some_data).T
Y = np.atleast_2d(other_data).T

X_train, X_test, y_train, y_test = train_test_split(X, Y,
                                                    test_size = 0.33,
                                                    random_state = 0)

kernel = ConstantKernel() + Matern() + WhiteKernel(noise_level=1)
gp = gaussian_process.GaussianProcessRegressor(
                                alpha=1e-10,
                                copy_X_train=True,
                                kernel = kernel,
                                n_restarts_optimizer=10,
                                normalize_y=False,
                                random_state=None)
gp.fit(X_train, y_train)
score = gp.score(X_test, y_test)
print(score)

x_pred = np.atleast_2d(np.linspace(0,10,1000)).T
y_pred, sigma = gp.predict(x_pred, return_std=True)

vs using the full data set to train like this.

X = np.atleast_2d(some_data).T
Y = np.atleast_2d(other_data).T

kernel = ConstantKernel() + Matern() + WhiteKernel(noise_level=1)
gp = gaussian_process.GaussianProcessRegressor(
                                alpha=1e-10,
                                copy_X_train=True,
                                kernel = kernel,
                                n_restarts_optimizer=10,
                                normalize_y=False,
                                random_state=None)
gp.fit(X, Y)
score = gp.score(X, Y)
print(score)

x_pred = np.atleast_2d(np.linspace(0,10,1000)).T
y_pred, sigma = gp.predict(x_pred, return_std=True)

Is one of these options going to result in incorrect predictions?

score 0 · Accepted Answer · answered Apr 06 '20 at 22:25

0

You split off the training data from the test data to evaluate your model, because otherwise you have no idea if you are over fitting the data. For example, just place data in excel and plot it with a smooth line. Technically, that spline function from excel is a perfect model but useless for predicting new values.

In your example, your predictions are over a uniform space to allow you to visualize what your model thinks is the underlying function. But it would be useless for understanding how general the model is. Sometimes you can get very high accuracy (> 95%) on training data and less than chance for testing data, which means the model is over fitting.

In addition to plotting a uniform prediction space to visualize the model, you should also predict values from the test set, then see accuracy metrics for both testing and training data.

answered Apr 06 '20 at 22:25

rwalroth

320
2
10

Could you please explain the last part a bit more? Overall, are you saying I need to stick to the former of my examples (splitting the data up for training/testing) and pull predictions from the test set in addition to the x_pred values? Thank you! – M-Wi Apr 06 '20 at 22:32
So you should create three arrays of predictions, one for X_test, one for X_train, and one for x_pred. Than I would look up SKLearn's evalutaion metrics for some options that might make sense but the standard accuracy is a good place to start. Compare the predictions from X_train to the y_train data, see how well the model did. Than compare predictions off of X_test to y_test, if this accuracy is drastically lower than the score for X_train it means your model is over fit. Finally, plot the predictions from x_pred to see the shape of the GP function – rwalroth Apr 07 '20 at 20:13

What is the difference between these two ways of specifying training/testing data for sklearn GPR

1 Answers1