Consider about following simulated data.
x1 <- c(rnorm(500000,5),rnorm(500000),rnorm(500000,5),rnorm(500000,15))
y1 <- c(rnorm(500000,5),rnorm(500000),rnorm(500000,15),rnorm(500000,5))
label <- rep(c("c1","c2","c3","c4"), each = 500000)
dset = data.frame(x1,y1,label)
with(dset,plot(x1,y1,col = label))
So there are 4 clusters and I want to use K means algorithm. It is generally said that using 20 - 25 'nstart' is appropriate. But how does it affect to big samples? Here my sample size is 2 millions. So, is there a way to decide 'nstart' for a big sample?
here is the code I sued. Note that, I want to use some parallel processing to my code, so that I can use 4 cores to get the work done.
parLapply( cl, c(25,25,25,25), fun=kmeans( x=dset[,c(1,2), centers=4, nstart=i ) )