9

I am designing an agglomerative, bottom-up clustering algorithm for millions of 50-1000 dimensional points. In two parts of my algorithm, I need to compare two clusters of points and decide the separation between the two clusters. The exact distance is the minimum Euclidean distance taken over all pairs of points P1-P2 where P1 is taken from cluster C1 and P2 is taken from cluster C2. If C1 has X points and C2 has Y points then this requires X*Y distance measurements.

I currently estimate this distance in a way that requires X+Y measurements:

  1. Find the centroid Ctr1 of cluster C1.
  2. Find the point P2 in cluster C2 that is closest to Ctr1. (Y comparisons.)
  3. Find the point P1 in C1 that is closest to P2. (X comparisons.)
  4. The distance from P1 to P2 is an approximate measure of the distance between the clusters C1 and C2. It is an upper-bound on the true value.

If clusters are roughly spherical, this works very well. My test data is composed of ellipsoidal Gaussian clusters, so it works very well. However, if clusters have weird, folded, bendy shapes, it may yield poor results. My questions are:

Is there an algorithm that uses even fewer than X+Y distance measurements and in the average case yields good accuracy?

OR

Is there an algorithm that (like mine) uses X+Y distance measurements but delivers better accuracy than mine?

(I am programming this in C#, but a description of an algorithm in pseudo-code or any other language is fine. Please avoid references to specialized library functions from R or Matlab. An algorithm with probabilistic guarantees like "95% chance that the distance is within 5% of the minimum value" is acceptable.)

NOTE: I just found this related question, which discusses a similar problem, though not necessarily for high dimensions. Given two (large) sets of points, how can I efficiently find pairs that are nearest to each other?

NOTE: I just discovered that this is called the bichromatic closest-pair problem.

For context, here is an overview of the overall clustering algorithm:

  1. The first pass consolidates the densest regions into small clusters using a space-filling curve (Hilbert Curve). It misses outliers and often fails to merge adjacent clusters that are very close to one another. However, it does discover a characteristic maximum linkage-distance. All points separated by less than this characteristic distance must be clustered together. This step has no predefined number of clusters as its goal.

  2. The second pass performs single-linkage agglomeration by combining clusters together if their minimum distance is less than the maximum linkage-distance. This is not hierarchical clustering; it is partition-based. All clusters whose minimum distance from one another is less than this maximum linkage-distance will be combined. This step has no predefined number of clusters as its goal.

  3. The third pass performs additional single-linkage agglomeration, sorting all inter-cluster distances and only combining clusters until the number of clusters equals a predefined target number of clusters. It handles some outliers, preferring to only merge outliers with large clusters. If there are many outliers (and their usually are), this may fail to reduce the number of clusters to the target.

  4. The fourth pass combines all remaining outliers with the nearest large cluster, but causes no large clusters to merge with other large clusters. (This prevents two adjacent clusters from accidentally being merged due to their outliers forming a thin chain between them.)

Community
  • 1
  • 1
Paul Chernoch
  • 5,275
  • 3
  • 52
  • 73
  • 1
    did you try something like [that](https://en.wikipedia.org/wiki/Closest_pair_of_points_problem) ? – Borbag Jan 06 '16 at 17:35
  • No! It helps to know the "name" of the problem! Thanks! I will read the article. – Paul Chernoch Jan 06 '16 at 17:37
  • Interesting algorithms in the article. However, the recurcsive divide-and-conquer algorithm's dependence on D (number of dimensions) is a problem, because for me, D is often greater than K (number of clusters). I will study it further. – Paul Chernoch Jan 06 '16 at 17:51
  • 1
    Could you try to reduce dimensionality, or are all the dimensions approximately equally important? – kfx Jan 06 '16 at 17:58
  • @kfx - In my real world data, I identified redundant dimensions and reduced it from 40,000 to 950 dimensions. In subsequent processing, I weight zip codes by population, but in the clustering, all dimensions that are not redundant are considered equally important. – Paul Chernoch Jan 06 '16 at 18:10
  • Do you have to use the dimension as they are, or can you perform a Principal Component Analysis on it, and only keep the significant ones ? – Borbag Jan 07 '16 at 17:42
  • I have never performed PCA. It looks interesting but the math is scary. I do perform simpler reductions of my data such as removing redundant dimensions or those that have pairwise 1-1 mappings. – Paul Chernoch Jan 07 '16 at 17:53

2 Answers2

0

You could use an index. That is the very classic solution.

A spatial index can help you find the nearest neighbor of any point in roughly O(log n) time. So if your clusters have n and m objects, choose the smaller cluster and index the larger cluster, to find the closest pair in O(n log m) or O(m log n).

A simpler heuristic approach is to iterate your idea multiple times, shrinking your set of candidates. So you find a good pair of objects a, b from the two clusters. Then you discard all objects from each cluster that must (by triangle inequality) be further apart (using the upper bound!). Then you repeat this, but not choosing the same a, b again. Once your candidate sets stops improving, do the pairwise comparisons on the remaining objects only. Worst case of this approach should remain O(n*m).

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
  • My first pass, using the Hilbert curve, is analogous to using a spatial index. That is how I perform the first cut at clustering. However, if the true solution has K clusters, I tend to end up with between 5K and 10K clusters after this step, thus the subsequent passes. To use a proper spatial index (like an R-tree) for more than 20 dimensions would be a bad idea. I am interested in spatial indices designed for high numbers of dimensions, but have not the skill at the moment. – Paul Chernoch Jan 06 '16 at 17:46
  • At 20+ dimensions *distance* is not reliable anymore. That is why the indexes fail. See the curse of dimensionality; it's all about distances being too similar. The Hilbert curve will also break down at around 5 dimensions often. Because to split every dimension just once, and have non-empty partitions, you need 2^d objects. If you want a nice curve, you want to have at least 2^{4d} objects. Do you have 2^80 objects? – Has QUIT--Anony-Mousse Jan 06 '16 at 17:48
  • Regarding Anony's comment on the "curse of dimensionality", does your data really have variance along all these dimensions? What does a PCA or SVD indicate? – nicholas Jan 06 '16 at 17:59
  • I am all too aware of the curse! However, I have found the Hilbert Curve approach to work well for me for 500 dimensions, but I may not be using it the way others are. (R-Trees do end up with mostly empty partitions as you correctly note, but I do not create partitions.) I assume that my clusters are well-separated. I have plans to write a divisive algorithm for a fifth pass to split overlapping clusters if they look to be multi-modal. – Paul Chernoch Jan 06 '16 at 18:00
  • @nicholas - For my real world data, I remove redundant dimensions (this reduces my problem from 40,000 dimensions to 950 dimensions). Along these 950 dimensions I DO have meaningful variance. (My domain is zip-code to zip-code delivery times using UPS or Fedex.) – Paul Chernoch Jan 06 '16 at 18:04
  • @PaulChernoch R-trees can never end up with empty partitions. You may be thinking of quadtrees? – Has QUIT--Anony-Mousse Jan 06 '16 at 19:32
  • Treating pairwise delivery times as raw data is probably where your error is... what is the *meaning* of Euclidean distance on these rows? Try treating this data as a distance matrix instead, not as data vectors. – Has QUIT--Anony-Mousse Jan 06 '16 at 19:33
  • @Anony-Mousse - you are correct, quadtrees can have empty cells. But you still need to have 2^D points to make many of these these tree structures worthwhile. – Paul Chernoch Jan 06 '16 at 19:36
  • Yes... but as noted above, this applies even more so for Hilbert curves. Because Hilbert curves are a linearized quadtree. – Has QUIT--Anony-Mousse Jan 06 '16 at 19:38
  • @Anony-Mousse - the meaning of Euclidean distance is this. If two ZIP codes have the same delivery time in days from every one of our supplier ZIP codes (several thousand suppliers), then their Euclidean distance is zero. TO the extent that they have similar delivery times to similar locations, they are separated by a small distance. Just take the square of the time difference per supplier zip code across all supplier dimensions between two destination zip codes and you have the square of the Cartesian (Euclidean) distance. – Paul Chernoch Jan 06 '16 at 19:40
  • Yes. But why don't you just take the actual delivery times as distance? – Has QUIT--Anony-Mousse Jan 06 '16 at 19:44
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/99973/discussion-between-paul-chernoch-and-anony-mousse). – Paul Chernoch Jan 06 '16 at 19:46
0

I have found a paper that describes a linear time, randomized, epsilon-approximate algorithm for the closest bichromatic point problem:

http://www.cs.umd.edu/~samir/grant/cp.pdf

I will attempt to implement it and see if it works.

UPDATE - After further study, it is apparent that the runtime is proportional to 3^D, where D is the number of dimensions. This is unacceptable. After trying several other approaches, I hit on the following.

  1. Perform a rough clustering into K clusters using an efficient but incomplete method. This method will properly cluster some points, but yield too many clusters. These small clusters remain to be further consolidated to form larger clusters. This method will determine an upper bound distance DMAX between points that are considered to be in the same cluster.
  2. Sort the points in Hilbert curve order.
  3. Throw out all points immediately preceded and succeeded by a neighbor from the same cluster. More often than not, these are interior points of a cluster, not surface points.
  4. For each point P1, search forward, but no farther than the next point from the same cluster.
  5. Compute the distance from the point P1 from cluster C1 to each visited point P2 from cluster C2 and record the distance if it is smaller than any prior distance measured between points in C1 and C2.
  6. However, if P1 has already been compared to a point in C2, do not do so again. Only make a single comparison between P1 and any point in C2.
  7. After all comparisons have been made, there will be at most K(K-1) distances recorded, and many discarded because they are larger than DMAX. These are estimated closest point distances.
  8. Perform merges between clusters if they are nearer than DMAX.

It is hard to conceive of how the Hilbert curve wiggles among the clusters, so my estimate of how efficient this approach to finding closest pairs was that it was proportional to K^2. However, my testing shops that it is closer to K. It may be around K*log(K). Further research is necessary.

As for accuracy:

  • Comparing every point to every other point is 100% accurate.
  • Using the centroid method outlined in my question has distances that are about 0.1% too high.
  • Using this method finds distances that are at worst 10% too high, and on average 5% too high. However, the true closest cluster almost always turns up as among the first through third closest cluster, so qualitatively it is good. The final clustering results using this method are excellent. My final clustering algorithm seems to be proportional to DNK or DNK*Log(K).
Paul Chernoch
  • 5,275
  • 3
  • 52
  • 73