Why does gpytorch seem to be less accurate than scikit-learn?

Question

I current found gpytorch (https://github.com/cornellius-gp/gpytorch). It seems to be a great package for integrating GPR into pytorch. First tests were also positive. Using gpytorch the GPU-Power as well as intelligent algorithms can used in order to improve performance in comparison to other packages such as scikit-learn.

However, I found that it is much harder to estimate the hyperparameters that are needed. In scikit-learn that happens in the background and is very robust. I would like get some feed from the community about the reasons and to discuss if there might be a better way to estimatethese parameter than provided by the example in the documentation of gpytorch.

For comparisson, I took the code of a provided example on the offcial page of gpytorch (https://github.com/cornellius-gp/gpytorch/blob/master/examples/03_Multitask_GP_Regression/Multitask_GP_Regression.ipynb) and modified it in two parts:

I use a different kernel (gpytorch.kernels.MaternKernel(nu=2.5) in stead of gpytorch.kernels.RBFKernel())
I used a different output function

In the following, I provide first the code using gpytorch. Subsequently, I provide the code for scikit-learn. Finally, I compare the results

Importing (for gpytorch and scikit-learn):

import math
import torch
import numpy as np
import gpytorch

Generating data (for gpytorch and scikit-learn):

n = 20
train_x = torch.zeros(pow(n, 2), 2)
for i in range(n):
    for j in range(n):
        # Each coordinate varies from 0 to 1 in n=100 steps
        train_x[i * n + j][0] = float(i) / (n-1)
        train_x[i * n + j][1] = float(j) / (n-1)

train_y_1 = (torch.sin(train_x[:, 0]) + torch.cos(train_x[:, 1]) * (2 * math.pi) + torch.randn_like(train_x[:, 0]).mul(0.01))/4
train_y_2 = torch.sin(train_x[:, 0]) + torch.cos(train_x[:, 1]) * (2 * math.pi) + torch.randn_like(train_x[:, 0]).mul(0.01)

train_y = torch.stack([train_y_1, train_y_2], -1)

test_x = torch.rand((n, len(train_x.shape)))
test_y_1 = (torch.sin(test_x[:, 0]) + torch.cos(test_x[:, 1]) * (2 * math.pi) + torch.randn_like(test_x[:, 0]).mul(0.01))/4
test_y_2 = torch.sin(test_x[:, 0]) + torch.cos(test_x[:, 1]) * (2 * math.pi) + torch.randn_like(test_x[:, 0]).mul(0.01)
test_y = torch.stack([test_y_1, test_y_2], -1)

Now comes the estimation as described in the provided example from the cited documentation:

torch.manual_seed(2) # For a more robust comparison
class MultitaskGPModel(gpytorch.models.ExactGP):
    def __init__(self, train_x, train_y, likelihood):
        super(MultitaskGPModel, self).__init__(train_x, train_y, likelihood)
        self.mean_module = gpytorch.means.MultitaskMean(
            gpytorch.means.ConstantMean(), num_tasks=2
        )
        self.covar_module = gpytorch.kernels.MultitaskKernel(
            gpytorch.kernels.MaternKernel(nu=2.5), num_tasks=2, rank=1
        )

    def forward(self, x):
        mean_x = self.mean_module(x)
        covar_x = self.covar_module(x)
        return gpytorch.distributions.MultitaskMultivariateNormal(mean_x, covar_x)


likelihood = gpytorch.likelihoods.MultitaskGaussianLikelihood(num_tasks=2)
model = MultitaskGPModel(train_x, train_y, likelihood)

# Find optimal model hyperparameters
model.train()
likelihood.train()

# Use the adam optimizer
optimizer = torch.optim.Adam([
    {'params': model.parameters()},  # Includes GaussianLikelihood parameters
], lr=0.1)

# "Loss" for GPs - the marginal log likelihood
mll = gpytorch.mlls.ExactMarginalLogLikelihood(likelihood, model)

n_iter = 50
for i in range(n_iter):
    optimizer.zero_grad()
    output = model(train_x)
    loss = -mll(output, train_y)
    loss.backward()
    # print('Iter %d/%d - Loss: %.3f' % (i + 1, n_iter, loss.item()))
    optimizer.step()

# Set into eval mode
model.eval()
likelihood.eval()

# Make predictions
with torch.no_grad(), gpytorch.settings.fast_pred_var():
    predictions = likelihood(model(test_x))
    mean = predictions.mean
    lower, upper = predictions.confidence_region()

test_results_gpytorch = np.median((test_y - mean) / test_y, axis=0)

In the following, I provide the code for scikit-learn. Which is a little bit more convenient^^:

from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import WhiteKernel, Matern
kernel = 1.0 * Matern(length_scale=0.1, length_scale_bounds=(1e-5, 1e5), nu=2.5) \
         + WhiteKernel()
gp = GaussianProcessRegressor(kernel=kernel, alpha=0.0).fit(train_x.numpy(),
                                                            train_y.numpy())
# x_interpolation = test_x.detach().numpy()[np.newaxis, :].transpose()
y_mean_interpol, y_std_norm = gp.predict(test_x.numpy(), return_std=True)

test_results_scitlearn = np.median((test_y.numpy() - y_mean_interpol) / test_y.numpy(), axis=0)

Finally I compare the results:

comparisson = (test_results_scitlearn - test_results_gpytorch)/test_results_scitlearn
print('Variable 1: scitkit learn is more accurate my factor: ' + str(abs(comparisson[0]))
print('Variable 2: scitkit learn is more accurate my factor: ' + str(comparisson[1]))

Unfortunatelly, I did not find an easy way to fix the seed for scikit-learn. The last time I have run the code, it returned:

Variable 1: scitkit learn is more accurate my factor: 11.362540360431087

Variable 2: scitkit learn is more accurate my factor: 29.64760087022618

In case of gpytorch, I assume that the optimizer runs in some local optima. But I cannot think of any more robust optimization algorithm that still uses pytorch.

I am looking forward for suggestions!

Lazloo

Discussions and Suggestions are both things we avoid. Not sure how this "question" fits into [ask]... lots of code lots of efford - not really a clear problem/question I can see. — Patrick Artner, Jan 27 '19 at 15:51

score 4 · Accepted Answer · answered Jan 27 '19 at 17:41

(I also answer your question on the GitHub issue you created for it here)

Primarily this happened because you used different models in sklearn and gpytorch. In particular, sklearn learns independent GPs in the multi-output setting by default (see e.g., the discussion here). In GPyTorch, you used the multitask GP method introduced in Bonilla et al, 2008. Correcting for this difference yields:

test_results_gpytorch = [5.207913e-04 -8.469360e-05]

test_results_scitlearn = [3.65288816e-04 4.79017145e-05]

Why does gpytorch seem to be less accurate than scikit-learn?

1 Answers1