Clustering geo location coordinates (lat,long pairs) using KMeans algorithm with Python

Question

Using the following code to cluster geolocation coordinates results in 3 clusters:

    import numpy as np
    import matplotlib.pyplot as plt
    from scipy.cluster.vq import kmeans2, whiten

    coordinates= np.array([
               [lat, long],
               [lat, long],
                ...
               [lat, long]
               ])
    x, y = kmeans2(whiten(coordinates), 3, iter = 20)  
    plt.scatter(coordinates[:,0], coordinates[:,1], c=y);
    plt.show()

Is it right to use Kmeans for location clustering, as it uses Euclidean distance and not Haversine formula as a distance function?

score 24 · Accepted Answer · answered Sep 22 '16 at 16:05

k-means is not a good algorithm to use for spatial clustering, for the reasons you meantioned. Instead, you could do this clustering job using scikit-learn's DBSCAN with the haversine metric and ball-tree algorithm.

This tutorial demonstrates clustering latitude-longitude spatial data with DBSCAN/haversine and avoids all those Euclidean-distance problems:

df = pd.read_csv('gps.csv')
coords = df.as_matrix(columns=['lat', 'lon'])
db = DBSCAN(eps=eps, min_samples=ms, algorithm='ball_tree', metric='haversine').fit(np.radians(coords))

Note that this specifically uses scikit-learn v0.15, as some earlier/later versions seem to require a full distance matrix to be computed. Also notice that the eps value is in radians and that .fit() takes the coordinates in radian units for the haversine metric.

`.as_matrix()` is deprecated >0.23.0. Use df.values instead. — There, May 18 '22 at 07:36

score 7 · Answer 2 · edited May 23 '17 at 11:47

It highly depends on your application:

Around the equator the results should be fairly accurate. Close to one of the poles the results won't be useful at all.
It might, however, work as a pre-pocessing step or for applications with low precision requirements, e.g. small, non-overlapping and very distinct clusters.

If you really need the Haversine formula, you might want to look into this discussion. As Anony-Mousse says:

Note that Haversine distance is not appropriate for k-means or average-linkage clustering, unless you find a smart way of computing the mean that minimizes variance. Do not use the arithmetic average if you have the -180/+180 wrap-around of latitude-longitude coordinates.

Clustering geo location coordinates (lat,long pairs) using KMeans algorithm with Python

2 Answers2