This is somewhat of a follow up to my previous question about evaluating my scikit Gaussian process regressor. I am very new to GPRs and I think that I may be making a methodological mistake in how I am using training vs testing data.
Essentially I'm wondering what the difference is between specifying training data by splitting the input between test and training data like this:
X = np.atleast_2d(some_data).T
Y = np.atleast_2d(other_data).T
X_train, X_test, y_train, y_test = train_test_split(X, Y,
test_size = 0.33,
random_state = 0)
kernel = ConstantKernel() + Matern() + WhiteKernel(noise_level=1)
gp = gaussian_process.GaussianProcessRegressor(
alpha=1e-10,
copy_X_train=True,
kernel = kernel,
n_restarts_optimizer=10,
normalize_y=False,
random_state=None)
gp.fit(X_train, y_train)
score = gp.score(X_test, y_test)
print(score)
x_pred = np.atleast_2d(np.linspace(0,10,1000)).T
y_pred, sigma = gp.predict(x_pred, return_std=True)
vs using the full data set to train like this.
X = np.atleast_2d(some_data).T
Y = np.atleast_2d(other_data).T
kernel = ConstantKernel() + Matern() + WhiteKernel(noise_level=1)
gp = gaussian_process.GaussianProcessRegressor(
alpha=1e-10,
copy_X_train=True,
kernel = kernel,
n_restarts_optimizer=10,
normalize_y=False,
random_state=None)
gp.fit(X, Y)
score = gp.score(X, Y)
print(score)
x_pred = np.atleast_2d(np.linspace(0,10,1000)).T
y_pred, sigma = gp.predict(x_pred, return_std=True)
Is one of these options going to result in incorrect predictions?