Performance issue with basic R

Question

I am implementing a kmeans algorithm in R however I'm having terrible performance issues. I come from python java and C++ so I'm not really used to code in the R way and so I wanted to know if I could get advice on basic operations to perform.

First is my function to get the distance between two points :

distance <- function(pt1, pt2){
  pt1 <- pt1[0:NUMBER_OF_FEATURES]
  pt2 <- pt2[0:NUMBER_OF_FEATURES]

  pt2 <- t(pt2)
  sum <- 0
  counter <- 1
  for (i in 1:nrow(pt2)){
    sum <- sum + ((pt1[counter] - pt2[counter])^2)
    counter <- counter + 1
  }
  value <- sqrt(sum)
  return(value)
}

It doesn't look like I can do much better from what I understand, but I know I shouldn't really be using for loops in R.

Also I have another function that focuses on updating the centroids of each cluster and I coded it like this :

update_centroids <- function(ptlst, centroids){
  centroids <- matrix(, nrow = NUMBER_OF_CLUSTERS, ncol = NUMBER_OF_FEATURES)

  for (i in 1:NUMBER_OF_CLUSTERS){
    temp <- ptlst[which(ptlst$cluster == i),]
    temp <- temp[0:NUMBER_OF_FEATURES]
    print(ncol(temp))
    centroid <- c()
    for (j in 1:ncol(temp)){
      centroid <- c(centroid, mean(as.numeric(unlist(temp[j]))))
    }
    print(centroid)
    centroids[i,] <- centroid
  }
  print(centroids)
}

Again, from what I understand, I shouldn't really be coding this part like this but use a general writing that would do this much faster.

Overall my full algorithm runs in 2.24 seconds on the iris dataset while my own implementation in python runs in 0.03 seconds

So I'm clearly doing something wrong here and there is something and takes a huge amount of time but i cannot get my hands on it

Thanks in advance for your answers, Shraneid

EDIT : dput generated file

R is not like C. You need to inform yourself about the concept of vectorization. For instance, you can simply do `sum((pt1 - pt2)^2)` instead of that loop. This does the whole loop in internal C code, which is much faster. — Roland, Nov 28 '18 at 11:46
my bad, I always forget and find it weird but in the end it does the same anyway (the indexing) — Shraneid, Nov 28 '18 at 11:46
The loop is treating `pt1` and `pt2` as if they were simple vectors rather than matrices, but in that case what is the point of `t(pt2)`? — John Coleman, Nov 28 '18 at 11:50
thanks for the sum ! I remember trying to use it before but it didn't work because I probably didn't do it right, works like a charm now ! — Shraneid, Nov 28 '18 at 11:50
@JohnColeman I did it to be able to iterate over it easily but now the sum function works perfectly but it didn't gain much time unfortunaltely — Shraneid, Nov 28 '18 at 11:51
@Shraneid can you give small example using for example `dput`? — minem, Nov 28 '18 at 11:54
You should figure out a way to vectorize this code, but it doesn't mean that your C++ background can't help you in R. If you implement kmeans in [rcpp](http://rcpp.org/), you will probably blow your Python version out of the water. — John Coleman, Nov 28 '18 at 12:02
@JohnColeman I tried multiple times to vectorize but I never got something that worked like I thought it would, for instance functions that applied on each column instead of each row, etc. but that is what I'm trying to get to understand with this post ! (And I don't really want to use rcpp, because I think it's not going to help me understand R, if I wanted to code it using c++ I'd do it directly in c++ x) ) — Shraneid, Nov 28 '18 at 12:04
@Shraneid `dput` is https://stat.ethz.ch/R-manual/R-devel/library/base/html/dput.html. — minem, Nov 28 '18 at 12:04
Your second function lacks context. It would help if you give a [mcve] (which includes how you are trying to apply it to the iris dataset). See [How to make a great R reproducible example?](https://stackoverflow.com/q/5963269/4996248) for what this would mean in R. — John Coleman, Nov 28 '18 at 12:07
@minem hey so I got the file with the dput, how should I share it ? — Shraneid, Nov 28 '18 at 12:12
@JohnColeman I'm afraid this would take me too long considering my experience with R, I have no idea how to do this efficiently but I'm just trying to understand how to do basic things like the fix that was proposed for the first function that works like a charm. Ultimately, my second function is taking a matrix of points, (one point on each row, with the actual cluster on the far right index) and the centroids matrix which is the same format than the ptlst but only has 3 lines — Shraneid, Nov 28 '18 at 12:16
@minem I added a link to download the file that dput generated :) — Shraneid, Nov 28 '18 at 12:19
@Shraneid you should not have added all of the dput output... I updated my answer. — minem, Nov 28 '18 at 12:27

minem · Accepted Answer · 2018-11-28T12:25:51.000

distance <- function(pt1, pt2){
  pt1 <- pt1[1:NUMBER_OF_FEATURES]
  pt2 <- pt2[1:NUMBER_OF_FEATURES]
  x <- sum((pt1 - pt2)^2)
  value <- sqrt(x)
  return(value)
}

For the second function you are growing object inside a loop, which is slow in R.

I guess your data looks like:

NUMBER_OF_CLUSTERS <- 2
NUMBER_OF_FEATURES <- 4 
n <- 100
set.seed(13)
ptlst <- data.frame(cluster = sample.int(NUMBER_OF_CLUSTERS, n, replace = T),
                    replicate(NUMBER_OF_FEATURES, rnorm(n)))
head(ptlst)
#   cluster         X1          X2         X3          X4
# 1       2  0.2731292 -2.84476384  0.6137843  2.10781521
# 2       1  0.7555251  1.71457759  0.4126145  1.57738122
# 3       1 -0.3490184 -1.22881682 -0.4588937  0.06149504
# 4       1 -0.5461908 -0.31407296 -0.6731785 -0.23792899
# 5       2  0.2343620 -0.06991232  0.1930543 -0.17730688
# 6       1 -0.2978282 -0.83760143  1.3829291 -1.17393025

So then we can try:

update_centroids <- function(ptlst){
  t(sapply(1:NUMBER_OF_CLUSTERS, function(i) {
    temp <- ptlst[which(ptlst$cluster == i),]
    colMeans(temp)
  }))
}
update_centroids(ptlst)
#      cluster          X1         X2          X3         X4
# [1,]       1  0.07365732 -0.0725119 -0.08745870 0.03406371
# [2,]       2 -0.24100628 -0.1044056  0.09288702 0.40949754

or using data.table

require(data.table)
x <- as.data.table(ptlst)
x[, lapply(.SD, mean), keyby = cluster]
#    cluster          X1         X2          X3         X4
# 1:       1  0.07365732 -0.0725119 -0.08745870 0.03406371
# 2:       2 -0.24100628 -0.1044056  0.09288702 0.40949754

I suggest that you start by reading some guides on R:

https://r4ds.had.co.nz/introduction.html https://cran.r-project.org/web/packages/data.table/vignettes/datatable-intro.html

etc.

There are a lot of useful materials online.

how could I go about changing the second function ? I tried multiple things they didn't work — Shraneid, Nov 28 '18 at 12:00
However I still have a huge performance issue and have no idea where it comes from :/ (Running in a total of 2.20 seconds) — Shraneid, Nov 28 '18 at 12:37
@Shraneid 2.2 seconds is not slow. Why it is a problem to you? Do you gave 'large' data to test this on? you can look at http://adv-r.had.co.nz/Profiling.html — minem, Nov 28 '18 at 12:45
I used the iris dataset which has 150 samples so 2.2 seconds is very slow you can easily compare to my python implementation that runs il 0.03 seconds, I expected the R version to be faster — Shraneid, Nov 28 '18 at 17:08

Performance issue with basic R

1 Answers1