5

kmeans does not work properly for geospatial coordinates - even when changing the distance function to haversine as stated here.

I had a look at DBSCAN which doesn t let me set a fixed number of clusters.

  1. Is there any algorithm (in python if possible) that has the same input values as kmeans? or
  2. Can I easily convert latitude, longitude to euclidean coordinates (x,y,z) as done here and do the calculation on my data?

It does not have to perfectly accurate, but it would nice if it would.

Community
  • 1
  • 1
kev
  • 8,928
  • 14
  • 61
  • 103
  • Can't find k-mediods in the latest release (3.5.0). Google finds me a very old version 2.2.0. I haven't done spatial calculations before, does it has a different name now? – kev Jul 01 '15 at 08:41
  • It seems that this boils down to the distance metric that you want. I would start with what is the best distance metric for your problem. – Ivan Jul 01 '15 at 14:49
  • A distance measure tend to hold to the triangle inequality, longitude and latitude do no hold to that. This is why great circle, haversine, or some other earth model is used to calculate the distance by first transforming lon and lat into coordinates. – invoketheshell Jul 02 '15 at 14:43

2 Answers2

4

Using just lat and longitude leads to problems when your geo data spans a large area. Especially since the distance between longitudes is less near the poles. To account for this it is good practice to first convert lon and lat to cartesian coordinates.

If your geo data spans the united states for example you could define an origin from which to calculate distance from as the center of the contiguous united states. I believe this is located at Latitude 39 degrees 50 minutes and Longitude 98 degrees 35 minute.

TO CONVERT lat lon to CARTESIAN coordinates- calculate the distance using haversine, from every location in your dataset to the defined origin. Again, I suggest Latitude 39 degrees 50 minutes and Longitude 98 degrees 35 minute.

You can use haversine in python to calculate these distances:

from haversine import haversine
origin = (39.50, 98.35)
paris = (48.8567, 2.3508)
haversine(origin, paris, miles=True)

Now you can use k-means on this data to cluster, assuming the haversin model of the earth is adequate for your needs. If you are doing data analysis and not planning on launching a satellite I think this should be okay.

invoketheshell
  • 3,819
  • 2
  • 20
  • 35
2

Have you tried kmeans? The issue raised in the linked question seems to be with points that are close to 180 degrees. If your points are all close enough together (like in the same city or country for example) then kmeans might work OK for you.

maxymoo
  • 35,286
  • 11
  • 92
  • 119