19

Using the following code to cluster geolocation coordinates results in 3 clusters:

    import numpy as np
    import matplotlib.pyplot as plt
    from scipy.cluster.vq import kmeans2, whiten

    coordinates= np.array([
               [lat, long],
               [lat, long],
                ...
               [lat, long]
               ])
    x, y = kmeans2(whiten(coordinates), 3, iter = 20)  
    plt.scatter(coordinates[:,0], coordinates[:,1], c=y);
    plt.show()

Is it right to use Kmeans for location clustering, as it uses Euclidean distance and not Haversine formula as a distance function?

rok
  • 9,403
  • 17
  • 70
  • 126

2 Answers2

24

k-means is not a good algorithm to use for spatial clustering, for the reasons you meantioned. Instead, you could do this clustering job using scikit-learn's DBSCAN with the haversine metric and ball-tree algorithm.

This tutorial demonstrates clustering latitude-longitude spatial data with DBSCAN/haversine and avoids all those Euclidean-distance problems:

df = pd.read_csv('gps.csv')
coords = df.as_matrix(columns=['lat', 'lon'])
db = DBSCAN(eps=eps, min_samples=ms, algorithm='ball_tree', metric='haversine').fit(np.radians(coords))

Note that this specifically uses scikit-learn v0.15, as some earlier/later versions seem to require a full distance matrix to be computed. Also notice that the eps value is in radians and that .fit() takes the coordinates in radian units for the haversine metric.

eos
  • 1,475
  • 1
  • 14
  • 25
7

It highly depends on your application:

  • Around the equator the results should be fairly accurate. Close to one of the poles the results won't be useful at all.
  • It might, however, work as a pre-pocessing step or for applications with low precision requirements, e.g. small, non-overlapping and very distinct clusters.

If you really need the Haversine formula, you might want to look into this discussion. As Anony-Mousse says:

Note that Haversine distance is not appropriate for k-means or average-linkage clustering, unless you find a smart way of computing the mean that minimizes variance. Do not use the arithmetic average if you have the -180/+180 wrap-around of latitude-longitude coordinates.

Community
  • 1
  • 1
Falko
  • 17,076
  • 13
  • 60
  • 105