Sure, we can make our own initialization routine. For example we can modify the Forgy method like this
# modified Forgy
set.seed(1)
c1 <- c(7.8, 4.3, 6.8, 2.4)
cn <- rbind(c1, iris[sample(nrow(iris), 2),-5])
kmeans(iris[,-5], cn)$centers
# Sepal.Length Sepal.Width Petal.Length Petal.Width
# 1 6.684427 2.626896 6.512092 2.09042298
# 2 5.078494 3.646351 1.485264 0.05223007
# 3 6.012102 2.553765 3.869828 1.66717281
The first initial centre is fixed, while the rest are selected randomly from the rows in the data set.
Of course this makes the nstart
argument inapplicable, but we can replicate this functionality easily by repeating the above calculation a bunch of times, and then pick the result with the highest BCSS
# modified Forgy with nstart
set.seed(1)
data(iris)
m <- iris[,-5]
# initializing with the actual centroid of the first species
c1 <- colMeans(m[as.integer(iris[,5]) == 1,])
c1
# Sepal.Length Sepal.Width Petal.Length Petal.Width
# 5.006 3.428 1.462 0.246
kf <- function(x, clust, nc) {
cn <- rbind(clust, x[sample(nrow(x), nc-1),])
kmeans(x, cn)
}
l <- replicate(100, kf(m, c1, 3), simplify=FALSE)
bss <- sapply(l, '[[', "betweenss")
table(signif(bss, 4))
#
# 538.6 602.5
# 37 63
kmo <- l[[which.max(bss)]]
kmo$centers
# Sepal.Length Sepal.Width Petal.Length Petal.Width
# 1 5.006000 3.428000 1.462000 0.246000
# 2 5.901613 2.748387 4.393548 1.433871
# 3 6.850000 3.073684 5.742105 2.071053