26

I have 1 million 5-dimensional points that I need to group into k clusters with k << 1 million. In each cluster, no two points should be too far apart (e.g. they could be bounding spheres with a specified radius). That means that there probably has to be many clusters of size 1.

But! I need the running time to be well below n^2. n log n or so should be fine. The reason I'm doing this clustering is to avoid computing a distance matrix of all n points (which takes n^2 time or many hours), instead I want to just compute distances between clusters.

I tried the pycluster k-means algorithm but quickly realized it's way too slow. I've also tried the following greedy approach:

  1. Slice space into 20 pieces in each dimension. (so there are 20^5 total pieces). I will store clusters in these gridboxes, according to their centroids.

  2. For each point, retrieve the gridboxes that are within r (maximum bounding sphere radius). If there is a near enough cluster, add it to that cluster, otherwise make a new cluster.

However, this seems to give me more clusters than I want. I also implemented approaches similar to this twice, and they give very different answers.

Are there any standard approaches to clustering in faster than n^2 time? Probabilistic algorithms are ok.

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
John Hawksley
  • 261
  • 1
  • 4
  • 3
  • 2
    You may explore BIRCH http://people.cs.ubc.ca/~rap/teaching/504/2005/slides/Birch.pdf – Dr. belisarius Dec 09 '10 at 23:36
  • Thanks, I will look at that. It seems like it basically uses a multi phase approach, which was also what I was thinking about doing next. E.g. first remove points that are far from all other points, then do some one-pass clustering, then do a rebalancing etc. Where each step is fast. But it's a lot of trouble to implement. – John Hawksley Dec 09 '10 at 23:45
  • Perhaps a dual KD tree algorithm can work here? – Jules Dec 10 '10 at 00:14
  • Perhaps this is useful: http://www.cs.cmu.edu/~avrim/Papers/bbg-clustering.pdf – Thomas Ahle Jan 10 '12 at 12:16
  • Here is a reasonably recent survey paper from IEEE: https://mospace.umsystem.edu/xmlui/bitstream/handle/10355/29297/Survey_Of_Clustering_Algorithms.pdf (Xu & Wunsch, IEEE Trans. Neural Networks vol. 16 no. 3, May 2005) – tripleee Sep 17 '14 at 05:46
  • 1
    The fastest method I have been able to find for 3 dimensional pixel data (C++ source) can be found here: http://www.modejong.com/blog/post17_divquant_clustering – MoDJ Apr 25 '16 at 18:35

6 Answers6

14

Consider an approximate nearest neighbor (ANN) algorithm or locality sensitive hashing (LSH). They don't directly solve the clustering problem, but they will be able to tell you which points are "close" to one another. By altering the parameters, you can define close to be as close as you want. And it's fast.

More precisely, LSH can provide a hash function, h, such that, for two points x and y, and distance metric d,

d(x,y) <= R1  =>  P(h(x) = h(y)) >= P1
d(x,y) >= R2  =>  P(h(x) = h(y)) <= P2

where R1 < R2 and P1 > P2. So yes, it is probabilistic. You can postprocess the retrieved data to arrive at true clusters.

Here is information on LSH including the E2LSH manual. ANN is similar in spirit; David Mount has information here, or try FLANN (has Matlab and Python bindings).

Steve Tjoa
  • 59,122
  • 18
  • 90
  • 101
  • Hmm, that's quite helpful. So given that I have an approximate nearest neighbors algorithm (I actually think scipy kdtree has this feature, although I don't know if its fast), are you suggesting an algorithm like the following: 1. For each point, approximately compute all nearest neighbors of radius less than r. 2. Sort the points from fewest close neighbors to most. 3. Make a list that will contain the integer cluster number for each point. 4. Iterating through the sorted list (of neighbor sets), assign all points in the set to a new cluster. (sort of a backwards greedy algorithm) – John Hawksley Dec 10 '10 at 09:29
  • Perhaps that may work. I like to imagine each bin in the hash table as becoming "one point". So instead of clustering one million individual points, you instead cluster *bins* which are collections of points, the assumption being that points in the same bin are already close enough. But finding those bins is fast. That's just one idea. – Steve Tjoa Dec 10 '10 at 16:27
  • OK I see your version, using the hash function. I will definitely try that as well, if I can find a python implementation of LSH. – John Hawksley Dec 10 '10 at 17:23
  • 1
    update: that package doesn't seem to really work, to my estimation. also i don't think it implements a euclidean hash. you definitely want the original link Steve provided: http://www.mit.edu/~andoni/LSH/. It links to matlab code here http://www.vision.caltech.edu/malaa/software/research/image-search/ – John Hawksley Dec 11 '10 at 00:33
  • Steve, have you used FLANN, any comments ? – denis Dec 15 '10 at 10:03
  • 1
    Denis: FLANN was the first one I tried because of the Python compatibility. If I recall, the interface was easy to use. Once I understood the concept better, I ended up writing my own which wasn't too hard. Of the many data structures available for solving ANN, FLANN chooses the best one for the input data. I knew what inputs I had, so I wrote one for me. But for flexibility and ease of use, FLANN is good. – Steve Tjoa Dec 15 '10 at 13:37
  • 1
    I'm sorry, but in my opinionion LSH is not the right solution. LSH was thought for high dimensional spaces (you don't need a ref. for this, every paper talking about LSH state this). Otherwise, kd-trees has been proved to be much better. In fact LSH complexity is pseudo-polynomial, while kd-tree is O(logn). Here, where we use 5-dim points, kd-tree should be much better. – justHelloWorld Jun 23 '16 at 04:01
6

You might like to try my research project called K-tree. It scales well with large inputs with respect to k-means and forms a hierarchy of clusters. The trade-off is that it produce clusters with higher distortion. It has an average case runtime of O(n log n) and worst case of O(n**2) that only happens if you have some weird topology. More details of the complexity analysis are in my Masters thesis. I have used it with very high dimensional text data and had no problems.

Sometimes bad splits can happen in the tree where all data goes to one side (cluster). The trunk in SVN deals with this differently than the current release. It randomly splits the data if there is a bad split. The previous method can force the tree to become too deep if there are bad splits.

Chris de Vries
  • 56,777
  • 5
  • 32
  • 27
5

Put the data into an index such as an R*-tree (Wikipedia), then you can run many density-based clustering algorithms (such as DBSCAN (Wikipedia) or OPTICS (Wikipedia)) in O(n log n).

Density-based clustering (Wikipedia) seems to be precisely what you want ("not too far apart")

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
  • 1
    R/R+/R*-Tree and KD-tree seem to be very similar. If you're implementing the data structure yourself, a KD-Tree might be far easier. OTOH, a R-Tree is good for successively adding data, because it's self-balancing, whereas the KD-Tree might get inefficient if it's not completely recomputed after addition of new data. See [this SO question](http://stackoverflow.com/questions/4326332/could-anyone-tell-me-whats-the-difference-between-kd-tree-and-r-tree). – akraf Oct 08 '16 at 22:54
3

People have the impression that k-means is slow, but slowness is really only an issue for the EM algorithm (Lloyd's). Stochastic gradient methods for k-means are orders of magnitude faster than EM (see www.eecs.tufts.edu/~dsculley/papers/fastkmeans.pdf).

An implementation is here: http://code.google.com/p/sofia-ml/wiki/SofiaKMeans

user1149913
  • 4,463
  • 1
  • 23
  • 28
3

Below is a little test bench to see how fast scipy.spatial.cKDTree is on your data, and to get a rough idea of how the distances between nearby points scatter.

A nice way to run K-cluster for various K is to build an MST of nearest pairs, and remove the K-1 longest; see Wayne, Greedy Algorithms .

Visualizing the clusters would be fun -- project to 2d with PCA ?

(Just curious, is your K 10, 100, 1000 ?)

Added 17 Dec: real runtimes: 100000 x 5 10 sec, 500000 x 5 60sec

#!/usr/bin/env python
# time scipy.spatial.cKDTree build, query

from __future__ import division
import random
import sys
import time
import numpy as np
from scipy.spatial import cKDTree as KDTree
    # http://docs.scipy.org/doc/scipy/reference/spatial.html
    # $scipy/spatial/kdtree.py is slow but clean, 0.9 has cython
__date__ = "2010-12-17 dec denis"

def clumpiness( X, nbin=10 ):
    """ how clumpy is X ? histogramdd av, max """
        # effect on kdtree time ? not much
    N, dim = X.shape
    histo = np.histogramdd( X, nbin )[0] .astype(int)  # 10^dim
    n0 = histo.size - histo.astype(bool).sum()  # uniform: 1/e^lambda
    print "clumpiness: %d of %d^%d data bins are empty  av %.2g  max %d" % (
        n0, nbin, dim, histo.mean(), histo.max())

#...............................................................................
N = 100000
nask = 0  # 0: ask all N
dim = 5
rnormal = .9
    # KDtree params --
nnear = 2  # k=nnear+1, self
leafsize = 10
eps = 1  # approximate nearest, dist <= (1 + eps) * true nearest
seed = 1

exec "\n".join( sys.argv[1:] )  # run this.py N= ...
np.random.seed(seed)
np.set_printoptions( 2, threshold=200, suppress=True )  # .2f
nask = nask or N
print "\nkdtree:  dim=%d  N=%d  nask=%d  nnear=%d  rnormal=%.2g  leafsize=%d  eps=%.2g" % (
    dim, N, nask, nnear, rnormal, leafsize, eps)

if rnormal > 0:  # normal point cloud, .9 => many near 1 1 1 axis
    cov = rnormal * np.ones((dim,dim)) + (1 - rnormal) * np.eye(dim)
    data = np.abs( np.random.multivariate_normal( np.zeros(dim), cov, N )) % 1
        # % 1: wrap to unit cube
else:
    data = np.random.uniform( size=(N,dim) )
clumpiness(data)
ask = data if nask == N  else random.sample( data, sample )
t = time.time()

#...............................................................................
datatree = KDTree( data, leafsize=leafsize )  # build the tree
print "%.1f sec to build KDtree of %d points" % (time.time() - t, N)

t = time.time()
distances, ix = datatree.query( ask, k=nnear+1, eps=eps )
print "%.1f sec to query %d points" % (time.time() - t, nask)

distances = distances[:,1:]  # [:,0] is all 0, point to itself
avdist = distances.mean( axis=0 )
maxdist = distances.max( axis=0 )
print "distances to %d nearest: av" % nnear, avdist, "max", maxdist

# kdtree:  dim=5  N=100000  nask=100000  nnear=2  rnormal=0.9  leafsize=10  eps=1
# clumpiness: 42847 of 10^5 data bins are empty  av 1  max 21
# 0.4 sec to build KDtree of 100000 points
# 10.1 sec to query 100000 points
# distances to 2 nearest: av [ 0.07  0.08] max [ 0.15  0.18]

# kdtree:  dim=5  N=500000  nask=500000  nnear=2  rnormal=0.9  leafsize=10  eps=1
# clumpiness: 2562 of 10^5 data bins are empty  av 5  max 80
# 2.5 sec to build KDtree of 500000 points
# 60.1 sec to query 500000 points
# distances to 2 nearest: av [ 0.05  0.06] max [ 0.13  0.13]
# run: 17 Dec 2010 15:23  mac 10.4.11 ppc 
denis
  • 21,378
  • 10
  • 65
  • 88
2

I have a Perl module that does exactly what you want Algorithm::ClusterPoints.

First, it uses the algorithm you have described in your post to divide the points in multidimensional sectors and then it uses brute force to find clusters between points in adjacent sectors.

The complexity varies from O(N) to O(N**2) in very degraded cases.

update:

@Denis: no, it is much worse:

For d dimensions, the sector (or little hypercube) size s is determined so that its diagonal l is the minimum distance c allowed between two points in different clusters.

l = c
l = sqrt(d * s * s)
s = sqrt(c * c / d) = c / sqrt(d)

Then you have to consider all the sectors that touch the hypersphere with diameter r = 2c + l centered in the pivot sector.

Roughly, we have to consider ceil(r/s) rows of sectors in every directions and that means n = pow(2 * ceil(r/s) + 1, d).

For instance, for d=5 and c=1 we get l=2.236, s=0.447, r=3.236 and n=pow(9, 5)=59049

Actually we have to check less neighbor sectors as here we are considering those that touch the hypercube of size (2r+1)/s and we only need to check those touching the circumscribed hypersphere.

Considering the bijective nature of the "are on the same cluster" relation we can also half the number of sectors that have to be checked.

Specifically, Algorithm::ClusterPoints for the case where d=5 checks 3903 sectors.

salva
  • 9,943
  • 4
  • 29
  • 57
  • Thanks! It sounds like what you're doing is dividing space into a grid, then using something smarter than my greedy algorithm to locally find clusters. Too bad I'm mediocre at programming and don't know perl so I'm only using python. Also, for extra credit I actually desire to run my algorithm for a full set of 1M points, but later be able to quickly see what the best clusters would have been if I'd only used various subsets (say 500K). The hash style approaches discussed in the previous answer may accomplish this as I only need to redo the post processing. – John Hawksley Dec 10 '10 at 17:44
  • so each sector (little hypercube) looks at 3^5 - 1 = 242 neighbours in 5d ? – denis Dec 15 '10 at 10:02
  • wow, going up fast with dim. Real runtimes would be interesting. – denis Dec 17 '10 at 14:29