4

I'm looking for a python library to organize a set of coordinates into clusters. My input is a list of (latitude, longitude) coordinates, and I want to get a list of clusters that group them according to distance.

I don't know before hand how many clusters I need to obtain, so I can't use something like a K-Means (like the cluster module) algorithm (not alone at least, maybe I there's an algorithm I can use to obtain that number based on the input data).

I also looked at clusterpy, but it seemed overly complicated for the task and the documentation is not very guiding.

Facundo Olano
  • 2,492
  • 2
  • 26
  • 32
  • Maybe the cluster module using Hierarchical clustering is the way to go. I initially discarded it as i thought I could only use exclusive methods, but with the dataset I have that might bit be a problem. – Facundo Olano Oct 31 '12 at 17:58
  • 1
    "I don't know before hand how many clusters I need to obtain, so I can't use something like a K-Means" You should almost certainly use k-means, and use any of the many algorithms for calculating k. See http://en.wikipedia.org/wiki/Determining_the_number_of_clusters_in_a_data_set and http://stackoverflow.com/questions/1793532/how-do-i-determine-k-when-using-k-means-clustering. – spencer nelson Oct 31 '12 at 18:32
  • 1
    I ended up using the Hierarchical algorithm. Since the coordinates belonged to addresses in different neighborhoods, and I wanted to get a cluster for each neighborhood, a sensible distance limit for the algorithm was enough to get the output I expected. – Facundo Olano Nov 01 '12 at 15:10

4 Answers4

2

I would recommend scikit learn. The linked page has a good discussion about different clustering algorithms. For geographic clustering (as someone above already suggested) DBSCAN works well.

JKucan
  • 21
  • 2
1

You may want to look into algorithms such as DBSCAN (Wikipedia) and OPTICS (Wikipedia). I don't know if there is any good Python implementation around though. The one I've seen mentioned here on SO for OPTICS seemed to be very incorrect and incomplete. DBSCAN is quite simple, you can implement it yourself.

Some key benefits:

  • You can use the great circle distance, which is more appropriate for lat/lng coordinates. K-means will have problems because of the wrap-around at 180° - the mean is not stable
  • You need to set two thresholds: a radius epsilon (for DBSCAN only), which with above distance would be in kilometers, and roughly a minimum cluster size. That parameter should be quite easy to set depending on your use case.
  • If you have a spatial index, it can accelerate the algorithm to O(n log n).
  • You don't need heuristics for the number of clusters!
Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
0

I have done exactly the same thing using Python a few years ago for gene sequences, it is completely doable.

To get an optimal number of clusters from the initial data, you need a penalizer while you go through them. There was an excellent section explaining how it can be done in the book Elements of Statistical Learning by Hastie-Tibshirani-Friedman : http://www-stat.stanford.edu/~tibs/ElemStatLearn/

This was where I learned it from, hope it helps!

Vandalay
  • 31
  • 6
0

I have some friends who've used NetWorkX for this type of problem. It's pretty well-written and the documentation is good, too.

dbn
  • 13,144
  • 3
  • 60
  • 86