2

Possible Duplicate:
Clustering Algorithm for Mapping Application

I have an unordered List of locations (containing their co-ordinates). I know to use the Haversine formula to calculate the distance between two points. But solutions for clustering I've looked at say I'd need to order the list first. What is the correct ordering for locations? I want to cluster (i.e. put all locations into a single clusteredLocation object) all locations which are within 1 metre of each other, is this feasible without sorting first?

Community
  • 1
  • 1
edwardmlyte
  • 15,937
  • 23
  • 58
  • 83
  • What is the problem with sorting the data first? – Peter Lawrey Jun 18 '12 at 14:58
  • 1
    How do you sort locations? As there are two vals, long and lat, is it just whichever is closest to a central point? – edwardmlyte Jun 18 '12 at 15:01
  • What is the "clustering behaviour" you expect for a number of points on a straight line around the globe, with a space of half a meter between them? Or if you have a point on every dot of a 0.5m grid all over - let's say - Honolulu. – brimborium Jun 18 '12 at 15:07
  • I have no idea what behaviour I'd expect from such a scenario. I also agree with trashgod's duplication spot. I thought I was asking a different question but looking at it now it's essentially the same. – edwardmlyte Jun 18 '12 at 15:19
  • What I would do is sort on the `x` and find all the groups which are more than 1m apart on the `x`. I would take these groups and sort the on the `y` and find all the groups which are 1m apart on the `y`. Amounst these groups. I would start with a the "first" point and find all the points which are within 1m of that point and any point I add. Then I would take the "next" point and find all the points within 1m of that point. etc. – Peter Lawrey Jun 18 '12 at 15:58
  • Who says you need to order them? Not necessary for k-means, single-link or any other [tag:cluster-analysis] algorithm I know. – Has QUIT--Anony-Mousse Jun 18 '12 at 16:41

1 Answers1

4

Actually none of the algorithms I know requires the points to be ordered. That would somewhat defeat the whole purpose of cluster analysis. But maybe you are more thinking of web2.0 kind of aggregation?

Have a look at k-means, single-link and DBSCAN. All well described on Wikipedia, with Hub article Cluster Analysis. None of these require your points to be ordered.

Note that Haversine distance is not appropriate for k-means or average-linkage clustering, unless you find a smart way of computing the mean that minimizes variance. Do not use the arithmetic average if you have the -180/+180 wrap-around of latitude-longitude coordinates. Single-linkage, complete-linkage, DBSCAN, OPTICS all should be fine.

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194