Haversine distance in sklearn.gaussian_process.kernels

Question

Is there a builtin way to pass custom distance functions to be used by the kernels you could use for Gaussian Process Models? In particular, I have geographic data in lat/lon coordinates, so using Euclidean distances would not give accurate distances between points. This seems like a not-uncommon use case for GPR, so wondering if there is a standard(-ish) way of implementing this in scikit-learn.

Right now, I've written a new Kernel sub-class where I replaced the metric argument in the pdist(metric='sqeuclidean') and cdist(metric='sqeuclidean') calls within the RBF kernel source code with arguments that you can set when instantiating the kernel, but this seems hack-y and I'm wondering if there's a better way to do this. Ultimately, it seems like you should be able to pass an arbitrary distance function to all of these kernels but I haven't been able to figure out how to do that. The class I wrote (which is almost entirely just the same as the standard kernels.RBF class) is below. Anyone see a better way to do this? Or why what I'm doing is a bad idea?

class customDistRBF(RBF):
    """Same as sklearn.gaussian_process.kernels.RBF except that
    it allows for a custom distance function in the RBF kernel.
    """

    def __init__(self, length_scale=1.0, length_scale_bounds=(1e-5, 1e5), dist_func='sqeuclidean'):
        RBF.__init__(self, length_scale=1.0, length_scale_bounds=length_scale_bounds)
        self.dist_func = dist_func

    def __call__(self, X, Y=None, eval_gradient=False):

        X = np.atleast_2d(X)
        length_scale = _check_length_scale(X, self.length_scale)
        if Y is None:
            # the below line is changed
            dists = pdist(X, metric=self.dist_func) / length_scale
            K = np.exp(-.5 * dists)
            # convert from upper-triangular matrix to square matrix
            K = squareform(K)
            np.fill_diagonal(K, 1)
        else:
            if eval_gradient:
                raise ValueError(
                    "Gradient can only be evaluated when Y is None.")
            # the below line is changed
            dists = cdist(X, Y,
                          metric=self.dist_func) / length_scale
            K = np.exp(-.5 * dists)

        if eval_gradient:
            if self.hyperparameter_length_scale.fixed:
                # Hyperparameter l kept fixed
                return K, np.empty((X.shape[0], X.shape[0], 0))
            elif not self.anisotropic or length_scale.shape[0] == 1:
                K_gradient = \
                    (K * squareform(dists))[:, :, np.newaxis]
                return K, K_gradient
            elif self.anisotropic:
                # We need to recompute the pairwise dimension-wise distances
                K_gradient = (X[:, np.newaxis, :] - X[np.newaxis, :, :]) ** 2 \
                    / (length_scale ** 2)
                K_gradient *= K[..., np.newaxis]
                return K, K_gradient
        else:
            return K

Do you need [something like this](https://stackoverflow.com/a/43192411/5741205)? — MaxU - stand with Ukraine, Apr 10 '18 at 21:33
Thanks for the note @MaxU. DistanceMetric.pairwise can give the haversine distance, but what I really want to evaluate is a RBF kernel function where the distance between two points is measured by the haversine distance. There doesn't appear to be a way to use a non-euclidean distance function in the RBF kernel, which is why I made a new class. I used pdist/cdist instead of DistanceMetric because it outputs results in a different format and the standard RBF class already uses pdist. — Ian Bolliger, Apr 11 '18 at 22:40

score 0 · Answer 1 · answered Oct 23 '18 at 15:00

I would suggest converting your lat/lon coordinates to Cartesian space and then you should be able to use any of the sklearn kernels that rely on Euclidean distance calculations.

def lon_lat_to_cartesian(lon, lat, R = 6371):
    """
    Returns Cartesian coordinates of a point on a sphere with radius R = 6371 
    km for earth
    """
    import numpy as np
    lon_r = np.radians(lon)
    lat_r = np.radians(lat)
    x =  R * np.cos(lat_r) * np.cos(lon_r)
    y = R * np.cos(lat_r) * np.sin(lon_r)
    z = R * np.sin(lat_r)
    return x,y,z

Haversine distance in sklearn.gaussian_process.kernels

1 Answers1