Should we used k-means++ instead of k-means?

Question

The k-means++ algorithm helps in two following points of the original k-means algorithm:

The original k-means algorithm has the worst case running time of super-polynomial in input size, while k-means++ has claimed to be O(log k).
The approximation found can yield a not so satisfactory result with respect to objective function compared to the optimal clustering.

But are there any drawbacks of k-means++? Should we always used it instead of k-means from now on?

Fred Foo · Accepted Answer · 2014-03-25T09:13:15.577

Nobody claims k-means++ runs in O(lg k) time; it's solution quality is O(lg k)-competitive with the optimal solution. Both k-means++ and the common method, called Lloyd's algorithm, are approximations to an NP-hard optimization problem.

I'm not sure what the worst case running time of k-means++ is; note that in Arthur & Vassilvitskii's original description, steps 2-4 of the algorithm refer to Lloyd's algorithm. They do claim that it works both better and faster in practice because it starts from a better position.

The drawbacks of k-means++ are thus:

It too can find a suboptimal solution (it's still an approximation).
It's not consistently faster than Lloyd's algorithm (see Arthur & Vassilvitskii's tables).
It's more complicated than Lloyd's algo.
It's relatively new, while Lloyd's has proven it's worth for over 50 years.
Better algorithms may exist for specific metric spaces.

That said, if your k-means library supports k-means++, then by all means try it out.

just a nitpick. It's log K competitive with optimal, not with Lloyd's. In fact LLoyd's can be arbitrarily bad w.r.t optimal, and has no sane approximation guarantee. — Suresh, Jan 18 '11 at 04:04
@Suresh: that's not a nitpick but a thinko on my side. Corrected. — Fred Foo, Jan 18 '11 at 11:32

score 7 · Answer 2 · edited May 23 '17 at 11:59

7

Not your question, but an easy speedup to any kmeans method for large N:

1) first do k-means on a random sample of say sqrt(N) of the points
2) then run full k-means from those centres.

I've found this 5-10 times faster than kmeans++ for N 10000, k 20, with similar results.
How well it works for you will depend on how well a sqrt(N) sample approximates the whole, as well as on N, dim, k, ninit, delta ...

What are your N (number of data points), dim (number of features), and k ?
The huge range in users' N, dim, k, data noise, metrics ... not to mention the lack of public benchmarks, make it tough to compare methods.

Added: Python code for kmeans() and kmeanssample() is here on SO; comments are welcome.

edited May 23 '17 at 11:59

Community

1
1

answered Jan 25 '11 at 17:12

denis

21,378
10
65
88

1

The paper, "Refining Initial Points for K-Means Clustering (1998)", by Bradley and Fayyad, describes a similar technique in greater detail: http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.44.5872 – Predictor Feb 03 '11 at 13:48
Thanks Predictor; have you ever used this ? (Good ideas get re-discovered, not-so-good ideas too.) – denis Feb 04 '11 at 10:14
Have you tried running **k-means++ on a random sample** first, then refining? – Has QUIT--Anony-Mousse Sep 03 '12 at 12:31
@Anony-Mousse, sounds reasonable but no I haven't. Correct me, data sets vary so widely that saying "use variant X on data like Y" is impossible ? – denis Sep 04 '12 at 11:36
Well, k-means++ is a more clever way of seeding on pretty much any kind of data than just choosing random objects. So actually there is little reason to not always use k-means++ unless you have a domain specific heuristic for choosing even better seeds. – Has QUIT--Anony-Mousse Sep 04 '12 at 11:40

Should we used k-means++ instead of k-means?

2 Answers2

Linked