8

I'm currently faced with the problem of finding a way to cluster around 500,000 latitude/longitude pairs in python. So far I've tried computing a distance matrix with numpy (to pass into the scikit-learn DBSCAN) but with such a large input it quickly spits out a Memory Error.

The points are stored in tuples containing the latitude, longitude, and the data value at that point.

In short, what is the most efficient way to spatially cluster a large number of latitude/longitude pairs in python? For this application, I'm willing to sacrifice some accuracy in the name of speed.

Edit: The number of clusters for the algorithm to find is unknown ahead of time.

user3681226
  • 85
  • 1
  • 5
  • [Quadtree](https://en.wikipedia.org/wiki/Quadtree)? There is a library [PyQuadTree](https://github.com/karimbahgat/PyQuadTree) that exists if you'd like to try that route. – Cory Kramer Jun 03 '14 at 20:43
  • maybe I'm displaying ignorance, but was is your criteria for # of clusters? (And snide comment, if you set number of clusters to 1, there's an O(1) algorithm to do it) – Foon Jun 03 '14 at 23:13
  • The K means algorithm requires you to give it the number of clusters to find as a parameter of the algorithm - as opposed to DBSCAN or OPTICS which determine the clusters based on other criteria. Also, although O(1) is almost worth it, unfortunately there is usually more than a single cloud system between the north and the south pole haha. – user3681226 Jun 03 '14 at 23:22

2 Answers2

6

Older versions of DBSCAN in scikit learn would compute a complete distance matrix.

Unfortunately, computing a distance matrix needs O(n^2) memory, and that is probably where you run out of memory.

Newer versions (which version do you use?) of scikit learn should be able to work without a distance matrix; at least when using an index. At 500.000 objects, you do want to use index acceleration, as this reduces runtime from O(n^2) to O(n log n).

I don't know how well scikit learn supports geodetic distance in its indexes though. ELKI is the only tool I know that can use R*-tree indexes for accelerating geodetic distance; making it extremely fast for this task (in particular when bulk-loading the index). You should give it a try.

Have a look at the Scikit learn indexing documentation, and try setting algorithm='ball_tree'.

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
4

I don't have your data so I just generated 500k random numbers into three columns.

import numpy as np
import matplotlib.pyplot as plt
from scipy.cluster.vq import kmeans2, whiten

arr = np.random.randn(500000*3).reshape((500000, 3))
x, y = kmeans2(whiten(arr), 7, iter = 20)  #<--- I randomly picked 7 clusters
plt.scatter(arr[:,0], arr[:,1], c=y, alpha=0.33333);

out[1]:

enter image description here

I timed this and it took 1.96 seconds to run this Kmeans2 so I don't think it has to do with the size of your data. Put your data in a 500000 x 3 numpy array and try kmeans2.

o-90
  • 17,045
  • 10
  • 39
  • 63
  • Can you think of any way to do this without knowing the number of clusters before running? Unfortunately, that is one of the issues with our data. – user3681226 Jun 03 '14 at 22:48
  • You don't need to know the number of clusters in advance here either. But you do have to specify how many you want. I'd just try different values and then plot the results to see what k is the best fit. I can edit the answer to include a plot if you want me to. – o-90 Jun 03 '14 at 23:23
  • That would be an ideal solution if I didn't have to repeat this process 100,000+ times. As it stands perhaps I can find some way to algorithmically find the optimal k. Thank you! – user3681226 Jun 04 '14 at 01:56
  • 3
    Don't use k-means on latitude, longitude data. The earth is not flat. – Has QUIT--Anony-Mousse Jun 04 '14 at 08:47