2

Consider about following simulated data.

x1 <- c(rnorm(500000,5),rnorm(500000),rnorm(500000,5),rnorm(500000,15))
y1 <- c(rnorm(500000,5),rnorm(500000),rnorm(500000,15),rnorm(500000,5))
label <- rep(c("c1","c2","c3","c4"), each = 500000)

dset = data.frame(x1,y1,label)
with(dset,plot(x1,y1,col = label))

So there are 4 clusters and I want to use K means algorithm. It is generally said that using 20 - 25 'nstart' is appropriate. But how does it affect to big samples? Here my sample size is 2 millions. So, is there a way to decide 'nstart' for a big sample?

here is the code I sued. Note that, I want to use some parallel processing to my code, so that I can use 4 cores to get the work done.

parLapply( cl, c(25,25,25,25), fun=kmeans( x=dset[,c(1,2), centers=4, nstart=i ) )
Hansy Kumaralal
  • 169
  • 3
  • 13
  • 1
    I get the impression you have done no searching, since I see no kmeans code. You can show me that impression is incorrect by including your search strategy. – IRTFM Oct 31 '16 at 05:07
  • @42- I didn't use the k means code because I though it might make it bit complicate since I have implemented some parallel processing also. How ever I pasted the code. In here I have used 'nstart' as 100 by dividing 25 to each core. What I want to know is how can I decide the 'nstart' for a big sample without guessing it randomly? – Hansy Kumaralal Oct 31 '16 at 06:19
  • Possibly related: http://stackoverflow.com/questions/5466323/how-exactly-does-k-means-work – Karsten W. Oct 31 '16 at 09:26

1 Answers1

1

n_start doesn't necessarily depend on the number of samples.

You will have data sets shere a single run will reliably find the best clustering you can get with k-means.

On other data sets, none will be good, because k-means doesn't work on the data at all.

I's rather do the following: run k-means a small number of times. If you get very similar results, use the best you've had once you stop seeing better results. If the results are very different, then k-means didn't work and you can just stop and do something else.

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194