How to decide 'nstart' for k means in r?

Question

Consider about following simulated data.

x1 <- c(rnorm(500000,5),rnorm(500000),rnorm(500000,5),rnorm(500000,15))
y1 <- c(rnorm(500000,5),rnorm(500000),rnorm(500000,15),rnorm(500000,5))
label <- rep(c("c1","c2","c3","c4"), each = 500000)

dset = data.frame(x1,y1,label)
with(dset,plot(x1,y1,col = label))

So there are 4 clusters and I want to use K means algorithm. It is generally said that using 20 - 25 'nstart' is appropriate. But how does it affect to big samples? Here my sample size is 2 millions. So, is there a way to decide 'nstart' for a big sample?

here is the code I sued. Note that, I want to use some parallel processing to my code, so that I can use 4 cores to get the work done.

parLapply( cl, c(25,25,25,25), fun=kmeans( x=dset[,c(1,2), centers=4, nstart=i ) )

I get the impression you have done no searching, since I see no kmeans code. You can show me that impression is incorrect by including your search strategy. — IRTFM, Oct 31 '16 at 05:07
@42- I didn't use the k means code because I though it might make it bit complicate since I have implemented some parallel processing also. How ever I pasted the code. In here I have used 'nstart' as 100 by dividing 25 to each core. What I want to know is how can I decide the 'nstart' for a big sample without guessing it randomly? — Hansy Kumaralal, Oct 31 '16 at 06:19
Possibly related: http://stackoverflow.com/questions/5466323/how-exactly-does-k-means-work — Karsten W., Oct 31 '16 at 09:26

score 1 · Answer 1 · answered Nov 01 '16 at 16:15

n_start doesn't necessarily depend on the number of samples.

You will have data sets shere a single run will reliably find the best clustering you can get with k-means.

On other data sets, none will be good, because k-means doesn't work on the data at all.

I's rather do the following: run k-means a small number of times. If you get very similar results, use the best you've had once you stop seeing better results. If the results are very different, then k-means didn't work and you can just stop and do something else.

How to decide 'nstart' for k means in r?

1 Answers1