0

I'm trying to write a function that loops through a list in order to run kmeans clustering on only specific columns of a dataset. I want the output to be a matrix/dataframe of the cluster membership of each observation when kmeans is run on each set of columns.

Here's a mock dataset and the function I came up with (I'm new to R--sorry if it's shaky)

set.seed(123)
mydata <- data.frame(a = rnorm(100,0,1), b = rnorm(100,0,1), c = 
rnorm(100,0,1), d = rnorm(100,0,1), e = rnorm(100,0,1)) 

set.seed(123)
my.kmeans <- function(data,k,...) {
    clusters <- data.frame(matrix(nrow = nrow(data), ncol = 
    length(list(...)))) # set up dataframe for clusters
    for(i in list(...)) {
        kmeans <- kmeans(data[,i],centers = k)
        clusters[,i] <- kmeans$cluster
    }
    colnames(clusters) <- list(...)
    clusters
}

My question is: this seems to work when I only ask it to use consecutive columns, but not when I ask it to skip around some. For instance, the first of the following works, but the second does not. Any idea how I can fix this?

# works how I want 
head(my.kmeans(data = mydata, k = 8, c(1,2), c(2,3), c(1,2,3)))

# doesn't work 
head(my.kmeans(data = mydata, k = 8, c(1,2), c(2,3), c(1,2,5)))

Also, I know people recommend using apply functions and staying away from for loops, but I don't know how to do this with an apply function. Any advice on that would be much appreciated as well.

Thanks so much!

Danny

Danny
  • 383
  • 2
  • 3
  • 16
  • the problem is in this part of the code `clusters[,i] <- kmeans$cluster` because `i` resolves to 5 in your second case – SatZ Jul 10 '18 at 07:14
  • Thanks so much @SatZ! Could you explain why i resolves to 5? And how I might get around this? Sorry--I'm pretty new to R. Thanks a lot! – Danny Jul 10 '18 at 19:04
  • 1
    For anyone who's following (though this is pretty specific so I doubt it), I think I figured it out: you have to change "for(i in list(....))" to "for(i in 1:length(list(...)))"; that way, when you subset with i later, it fills in correctly. Thanks @SatZ – Danny Jul 11 '18 at 19:23

1 Answers1

1

Building on @SatZ's comments,

set.seed(123)
mydata <- data.frame(a = rnorm(100,0,1), b = rnorm(100,0,1), c = 
                   rnorm(100,0,1), d = rnorm(100,0,1), e = 
                   rnorm(100,0,1)) 
mylist <- list(c(1,2), c(2,3), c(1,2,5))

set.seed(123)
my.kmeans <- function(data,k,list) {
  clusters <- data.frame(matrix(nrow = nrow(data), ncol = 
                              length(list))) # set up dataframe for 
                              clusters
  for(i in 1:length(list)) {
      kmeans <- kmeans(data[,list[[i]]],centers = k)
      clusters[,i] <- kmeans$cluster
  }
  colnames(clusters) <- list
  clusters
}

head(my.kmeans(data = mydata, k = 8, list = mylist))
Danny
  • 383
  • 2
  • 3
  • 16
  • you could look at this for more details on how to use ellipsis (...) https://stackoverflow.com/questions/5890576/usage-of-three-dots-or-dot-dot-dot-in-functions https://stackoverflow.com/questions/13353847/how-to-expand-an-ellipsis-argument-without-evaluating-it-in-r – SatZ Jul 12 '18 at 04:19
  • definitely. just thought it would be easier to follow length(list) than length(list(...)). Thanks so much for your help! – Danny Jul 12 '18 at 14:46