0

I am implementing a kmeans algorithm in R however I'm having terrible performance issues. I come from python java and C++ so I'm not really used to code in the R way and so I wanted to know if I could get advice on basic operations to perform.

First is my function to get the distance between two points :

distance <- function(pt1, pt2){
  pt1 <- pt1[0:NUMBER_OF_FEATURES]
  pt2 <- pt2[0:NUMBER_OF_FEATURES]

  pt2 <- t(pt2)
  sum <- 0
  counter <- 1
  for (i in 1:nrow(pt2)){
    sum <- sum + ((pt1[counter] - pt2[counter])^2)
    counter <- counter + 1
  }
  value <- sqrt(sum)
  return(value)
} 

It doesn't look like I can do much better from what I understand, but I know I shouldn't really be using for loops in R.

Also I have another function that focuses on updating the centroids of each cluster and I coded it like this :

update_centroids <- function(ptlst, centroids){
  centroids <- matrix(, nrow = NUMBER_OF_CLUSTERS, ncol = NUMBER_OF_FEATURES)

  for (i in 1:NUMBER_OF_CLUSTERS){
    temp <- ptlst[which(ptlst$cluster == i),]
    temp <- temp[0:NUMBER_OF_FEATURES]
    print(ncol(temp))
    centroid <- c()
    for (j in 1:ncol(temp)){
      centroid <- c(centroid, mean(as.numeric(unlist(temp[j]))))
    }
    print(centroid)
    centroids[i,] <- centroid
  }
  print(centroids)
}

Again, from what I understand, I shouldn't really be coding this part like this but use a general writing that would do this much faster.

Overall my full algorithm runs in 2.24 seconds on the iris dataset while my own implementation in python runs in 0.03 seconds

So I'm clearly doing something wrong here and there is something and takes a huge amount of time but i cannot get my hands on it

Thanks in advance for your answers, Shraneid

EDIT : dput generated file

Shraneid
  • 309
  • 1
  • 2
  • 12
  • 3
    `pt1 <- pt1[0:...]`? In R indexing begins with 1. – jogo Nov 28 '18 at 11:44
  • 4
    R is not like C. You need to inform yourself about the concept of vectorization. For instance, you can simply do `sum((pt1 - pt2)^2)` instead of that loop. This does the whole loop in internal C code, which is much faster. – Roland Nov 28 '18 at 11:46
  • my bad, I always forget and find it weird but in the end it does the same anyway (the indexing) – Shraneid Nov 28 '18 at 11:46
  • The loop is treating `pt1` and `pt2` as if they were simple vectors rather than matrices, but in that case what is the point of `t(pt2)`? – John Coleman Nov 28 '18 at 11:50
  • thanks for the sum ! I remember trying to use it before but it didn't work because I probably didn't do it right, works like a charm now ! – Shraneid Nov 28 '18 at 11:50
  • what is the structure of `ptlst`? – minem Nov 28 '18 at 11:51
  • @JohnColeman I did it to be able to iterate over it easily but now the sum function works perfectly but it didn't gain much time unfortunaltely – Shraneid Nov 28 '18 at 11:51
  • @minem Essentially just a matrix with a point on each row – Shraneid Nov 28 '18 at 11:52
  • @Shraneid can you give small example using for example `dput`? – minem Nov 28 '18 at 11:54
  • @minem I would however I'm afraid I don't know what dput is – Shraneid Nov 28 '18 at 12:01
  • You should figure out a way to vectorize this code, but it doesn't mean that your C++ background can't help you in R. If you implement kmeans in [rcpp](http://rcpp.org/), you will probably blow your Python version out of the water. – John Coleman Nov 28 '18 at 12:02
  • @JohnColeman I tried multiple times to vectorize but I never got something that worked like I thought it would, for instance functions that applied on each column instead of each row, etc. but that is what I'm trying to get to understand with this post ! (And I don't really want to use rcpp, because I think it's not going to help me understand R, if I wanted to code it using c++ I'd do it directly in c++ x) ) – Shraneid Nov 28 '18 at 12:04
  • @Shraneid `dput` is https://stat.ethz.ch/R-manual/R-devel/library/base/html/dput.html. – minem Nov 28 '18 at 12:04
  • 1
    Your second function lacks context. It would help if you give a [mcve] (which includes how you are trying to apply it to the iris dataset). See [How to make a great R reproducible example?](https://stackoverflow.com/q/5963269/4996248) for what this would mean in R. – John Coleman Nov 28 '18 at 12:07
  • @minem hey so I got the file with the dput, how should I share it ? – Shraneid Nov 28 '18 at 12:12
  • @JohnColeman I'm afraid this would take me too long considering my experience with R, I have no idea how to do this efficiently but I'm just trying to understand how to do basic things like the fix that was proposed for the first function that works like a charm. Ultimately, my second function is taking a matrix of points, (one point on each row, with the actual cluster on the far right index) and the centroids matrix which is the same format than the ptlst but only has 3 lines – Shraneid Nov 28 '18 at 12:16
  • @minem I added a link to download the file that dput generated :) – Shraneid Nov 28 '18 at 12:19
  • @Shraneid you should not have added all of the dput output... I updated my answer. – minem Nov 28 '18 at 12:27

1 Answers1

3
distance <- function(pt1, pt2){
  pt1 <- pt1[1:NUMBER_OF_FEATURES]
  pt2 <- pt2[1:NUMBER_OF_FEATURES]
  x <- sum((pt1 - pt2)^2)
  value <- sqrt(x)
  return(value)
} 

For the second function you are growing object inside a loop, which is slow in R.

I guess your data looks like:

NUMBER_OF_CLUSTERS <- 2
NUMBER_OF_FEATURES <- 4 
n <- 100
set.seed(13)
ptlst <- data.frame(cluster = sample.int(NUMBER_OF_CLUSTERS, n, replace = T),
                    replicate(NUMBER_OF_FEATURES, rnorm(n)))
head(ptlst)
#   cluster         X1          X2         X3          X4
# 1       2  0.2731292 -2.84476384  0.6137843  2.10781521
# 2       1  0.7555251  1.71457759  0.4126145  1.57738122
# 3       1 -0.3490184 -1.22881682 -0.4588937  0.06149504
# 4       1 -0.5461908 -0.31407296 -0.6731785 -0.23792899
# 5       2  0.2343620 -0.06991232  0.1930543 -0.17730688
# 6       1 -0.2978282 -0.83760143  1.3829291 -1.17393025

So then we can try:

update_centroids <- function(ptlst){
  t(sapply(1:NUMBER_OF_CLUSTERS, function(i) {
    temp <- ptlst[which(ptlst$cluster == i),]
    colMeans(temp)
  }))
}
update_centroids(ptlst)
#      cluster          X1         X2          X3         X4
# [1,]       1  0.07365732 -0.0725119 -0.08745870 0.03406371
# [2,]       2 -0.24100628 -0.1044056  0.09288702 0.40949754

or using data.table

require(data.table)
x <- as.data.table(ptlst)
x[, lapply(.SD, mean), keyby = cluster]
#    cluster          X1         X2          X3         X4
# 1:       1  0.07365732 -0.0725119 -0.08745870 0.03406371
# 2:       2 -0.24100628 -0.1044056  0.09288702 0.40949754

I suggest that you start by reading some guides on R:

https://r4ds.had.co.nz/introduction.html https://cran.r-project.org/web/packages/data.table/vignettes/datatable-intro.html

etc.

There are a lot of useful materials online.

minem
  • 3,640
  • 2
  • 15
  • 29
  • how could I go about changing the second function ? I tried multiple things they didn't work – Shraneid Nov 28 '18 at 12:00
  • However I still have a huge performance issue and have no idea where it comes from :/ (Running in a total of 2.20 seconds) – Shraneid Nov 28 '18 at 12:37
  • @Shraneid 2.2 seconds is not slow. Why it is a problem to you? Do you gave 'large' data to test this on? you can look at http://adv-r.had.co.nz/Profiling.html – minem Nov 28 '18 at 12:45
  • I used the iris dataset which has 150 samples so 2.2 seconds is very slow you can easily compare to my python implementation that runs il 0.03 seconds, I expected the R version to be faster – Shraneid Nov 28 '18 at 17:08