16

Has anyone tried to apply a smoother to the evaluation metric before applying the L-method to determine the number of k-means clusters in a dataset? If so, did it improve the results? Or allow a lower number of k-means trials and hence much greater increase in speed? Which smoothing algorithm/method did you use?

The "L-Method" is detailed in: Determining the Number of Clusters/Segments in Hierarchical Clustering/Segmentation Algorithms, Salvador & Chan

This calculates the evaluation metric for a range of different trial cluster counts. Then, to find the knee (which occurs for an optimum number of clusters), two lines are fitted using linear regression. A simple iterative process is applied to improve the knee fit - this uses the existing evaluation metric calculations and does not require any re-runs of the k-means.

For the evaluation metric, I am using a reciprocal of a simplified version of the Dunns Index. Simplified for speed (basically my diameter and inter-cluster calculations are simplified). The reciprocal is so that the index works in the correct direction (ie. lower is generally better).

K-means is a stochastic algorithm, so typically it is run multiple times and the best fit chosen. This works pretty well, but when you are doing this for 1..N clusters the time quickly adds up. So it is in my interest to keep the number of runs in check. Overall processing time may determine whether my implementation is practical or not - I may ditch this functionality if I cannot speed it up.

Mat
  • 202,337
  • 40
  • 393
  • 406
winwaed
  • 7,645
  • 6
  • 36
  • 81
  • Thinking about this further, I don't think an even (ie. running average) smoother would have any notable effect, because the L-Method then fits lines using least squares. However, a shaped smoother such as a Gaussian might behave differently. I am going to try and implement a Gaussian of moderate size (half width of about 6-10 seems about right to me). It is going to be a qualitative test. – winwaed Nov 11 '10 at 23:12
  • I do think this will be a good moderate sized research project. If there are any college students looking for a project, I'd be interested in collaborating/mentoring/co-authoring. Such a project should perform quantitative comparisons and be more general than my specific application. I'll add the project-ideas tag to the question. – winwaed Nov 11 '10 at 23:15
  • I have some very rough, unscientific and qualitative results: I tried Gaussian filters of HalfWidthHalfHeight of 5 and 3. In both cases, it increased the estimated number of clusters, but the estimated error dropped (I ran tests of about 8-10 runs with each configuration). This is real world data, and an increase in the estimate is plausible. So I think this provides enough to warrant a mini research project with controlled data and under better conditions. – winwaed Nov 17 '10 at 15:16
  • @winwaed Did you try if removing outliers helps? I've had some good experiences pre-processing the cloud, removing outliers, determining K and then inserting outliers again. Of course I had an heuristic for determining what was an oulier ... – Dr. belisarius Nov 26 '10 at 19:05
  • @belisaurus So this would be for the actual clustering, rather than theL method? No I have not tried that. I did think of trying to remove outliers in the curve for the L method (eg. A kind of median filter) but I didn't think this would work well in 1d. – winwaed Nov 26 '10 at 19:55
  • @winwaed, automatically finding knees is imho something for the Journal of Nonreproducible Results; I'm happy to look at a cheap plot, e.g. MST with long links. But try stats.stackexchange.com ? – denis Nov 30 '10 at 15:09
  • @Denis: Perhaps in more general cases, however it is curve fitting, so if the data is roughly an "L" then it should work. Also the paper referenced reports success with the method. – winwaed Dec 03 '10 at 13:59
  • I'm getting file not found on the link to L-method – Andres Jan 06 '11 at 21:22
  • @Andres: Looks like I had a deep search link. I've just found a different, shorter academic-based URL and updated the link. – winwaed Jan 07 '11 at 01:21
  • @winwaed: Because it has been a long time and there is no accepted solution, I was wondering if you tried out anything new in this area. My specific question would be if removing outliers or applying smoothing actually help your clustering results. – Legend Jul 12 '11 at 19:11
  • 1
    What I have been doing, is skipping the calculation for some values of N. If we are interesting cluster counts from M to N, then I calculate to 2N, to give enough of a line on the right hand side. by dropping some of these high counts (eg. Only do alternates beyond a certain point), I get similar accuracy with some decent time savings. A couple of weeks ago, I also multithreaded the code which makes a big difference on a Core i7 :-) – winwaed Jul 12 '11 at 19:19
  • @winwaed: +1 Thank you for getting back so quick. I was facing an issue posted here: http://stackoverflow.com/questions/6659648/finding-the-elbow-point-of-a-curve-in-a-stable-way In a way, I am not understanding how to get a stable estimate of the the curve leave alone estimating the elbow of the curve. The two plots shown there were from two different runs of 20 iterations of k-means. Also, if possible, would you mind elaborating a bit on your technique on what criteria you are using to discard the high counts? – Legend Jul 12 '11 at 19:58
  • 1
    I just calculate for alternate (eg. Odd values) values. I only do this at the high end where it is less critical. – winwaed Jul 13 '11 at 00:23
  • 1
    There also is X-means, a k-means variant that starts with k=2, then iteratively splits clusters further. – Has QUIT--Anony-Mousse Mar 27 '13 at 11:28

1 Answers1

5

I had asked a similar question in the past here on SO. My question was about coming up with a consistent way of finding the knee to the L-shape you described. The curves in question represented the trade-off between complexity and a fit measure of the model.

The best solution was to find the point with the maximum distance d according to the figure shown:

alt text

Note: I haven't read the paper you linked to yet..

Community
  • 1
  • 1
Amro
  • 123,847
  • 25
  • 243
  • 454
  • Thanks for the reply. That looks to be taking a more geometric approach to the paper, but I would not be surprised if it reduced to the same (or very similar) maths. My question was whether it was better to smooth the data first, and for a very specific application (data points are fit measures for clusters of varying counts). – winwaed Jan 08 '11 at 00:47
  • @Amro: Did you find this technique work better than the second derivative test? Is there a standard name for this technique by any chance? – Legend Jul 12 '11 at 20:02
  • The L Method is what the paper calls it. I think I have too much noise for a second derivative to accurately find the knee. – winwaed Jul 13 '11 at 00:22
  • @Legend: as @winwaed mentioned, second derivatives are very sensitive to noise, that's why I used the geometric approach above... – Amro Jul 13 '11 at 18:13