I'm trying to write a function that loops through a list in order to run kmeans clustering on only specific columns of a dataset. I want the output to be a matrix/dataframe of the cluster membership of each observation when kmeans is run on each set of columns.
Here's a mock dataset and the function I came up with (I'm new to R--sorry if it's shaky)
set.seed(123)
mydata <- data.frame(a = rnorm(100,0,1), b = rnorm(100,0,1), c =
rnorm(100,0,1), d = rnorm(100,0,1), e = rnorm(100,0,1))
set.seed(123)
my.kmeans <- function(data,k,...) {
clusters <- data.frame(matrix(nrow = nrow(data), ncol =
length(list(...)))) # set up dataframe for clusters
for(i in list(...)) {
kmeans <- kmeans(data[,i],centers = k)
clusters[,i] <- kmeans$cluster
}
colnames(clusters) <- list(...)
clusters
}
My question is: this seems to work when I only ask it to use consecutive columns, but not when I ask it to skip around some. For instance, the first of the following works, but the second does not. Any idea how I can fix this?
# works how I want
head(my.kmeans(data = mydata, k = 8, c(1,2), c(2,3), c(1,2,3)))
# doesn't work
head(my.kmeans(data = mydata, k = 8, c(1,2), c(2,3), c(1,2,5)))
Also, I know people recommend using apply functions and staying away from for loops, but I don't know how to do this with an apply function. Any advice on that would be much appreciated as well.
Thanks so much!
Danny