-1

I'm trying to reproduce this equation in R to do Kernel K-Means clustering: enter image description here

But the loop that I created it's taking too long to finish, and I don't know how to improve it, here's is the example of the part of the code that is giving problem:

c=3
for (g in 1:c) { 
  ans = 0
  for (k in 1:nrow(iris)) {
    for (l in 1:nrow(iris)) {
      ans = ans + (iris[k,'cluster']==g) *(iris[l,'cluster']==g)*kernelmatrix[k,l]
      }
    }
  third[g] = ans
  }   

This is a pseudo code, because it's only a part of the full function, the expression (iris[l,'cluster']==g) it's to verify if the element iris[l,'cluster'] belongs to cluster g, and the kernelmatrix[k,l] it's an element from the nxn matrix of kernel operations.

I know that R isnt' too good for loops, so I don't know how to improve it the loops.

EDIT: Here's the code with the kernelmatrix part, but I think that isnt't important to the code (where you all read data, can think that is any dataset like the iris for example:

## Euclidian Distance  
        # Remember: 
        #1.|| a || = sqrt(aDOTa), 
        #2. d(x,y) = || x - y || = sqrt((x-y)DOT(x-y))
        #3. aDOTb = sum(a*b)


        d<-function(x,y){
                aux=x-y
                dis=sqrt(sum(aux*aux))
                return(dis)
        }

        ##Radial Basis Function Kernel
        # Remember :
        # 1.K(x,x')=exp(-q||x-x'||^2) where ||x-x'|| is could be defined as the
        # euclidian distance and 'q' it's the gamma parameter
        rbf<-function(x,y,q=0.2){
                aux<-d(x,y)
                rbfd<-exp(-q*(aux)^2)
                return(rbfd)
        }
        #
        #calculating the kernel matrix
        kernelmatrix=matrix(0,nrow(data),nrow(data))
        for(i in 1:nrow(data)){
                for(j in 1:nrow(data)){
                        kernelmatrix[i,j]=rbf(data[i,1:(ncol(data)-1)],data[j,1:(ncol(data)-1)],q)
                }
        }
  • 2
    As you said, R is bad with for loops. Your code seems "vectorizable" so the `apply()` family should do the trick. The other solution is using `Rcpp`. – VFreguglia May 28 '18 at 20:45
  • 1
    Actually R is good in loops, specially with the recent updates - see this [post](https://stackoverflow.com/questions/42393658/lapply-vs-for-loop-performance-r). The problem I see in the code is it has 3 `for` loops. Can you do `dput(kernelmatrix)`? – patL May 28 '18 at 20:50
  • 1
    iris doesn't have `cluster` column? – Mislav May 28 '18 at 20:58
  • dont have but that would be something equivalent that each specie correspond to a number – Mateus Maia May 28 '18 at 21:04
  • have you try to find the package that already have function for calculating kk mreans? for example https://www.rdocumentation.org/packages/kernlab/versions/0.9-26/topics/kkmeans – Mislav May 28 '18 at 21:07
  • I'm trying to implementing my version of the package, the kernlab doesnt return the cluster classifications – Mateus Maia May 28 '18 at 21:12
  • Have you tried nested `foreach` for parallelization? – Mislav May 28 '18 at 21:14
  • Isn't the case to use `foreach` beucase have an if condition there – Mateus Maia May 28 '18 at 21:22
  • `kernlab::kkmeans()` *does* return the cluster assignments, it is in the `sc@.Data` slot: `library(kernlab); data(iris); sc <- kkmeans(as.matrix(iris[,-5]), centers=3); sc@.Data` result: `2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 3 1 3 1 3 1 3 3 3 3 1 3 1 3 3 1 3 1 3 1 1 1 1 1 1 1 3 3 3 3 1 3 1 1 1 3 3 3 1 3 3 3 3 3 3 3 3 1 1 1 1 1 1 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1` – knb May 29 '18 at 14:52

3 Answers3

0

Have you tried using something like the Kernlab package? Many package authors will have implemented such things in C++ so will be much higher performance than a hand-rolled equation, even once you've vectorised this code (which is the essential step if you want it to perform reasonably).

Andrew Hill
  • 307
  • 2
  • 12
0

The R interpreter is indeed generally slow. It does not seem to matter much whether you use for loops or other loop constructs. So try to minimize the amount of actual R code, and when performance gets problematic, consider rewriting the code in C. Use R only as a "driver".

In your case, there are several obvious issues:

Your computation is supposedly symmetric (if your kernel function is symmetric). If you exploit this, you'll be twice as fast. The inner loop does not need to run at all, if the point is not in the cluster. You are summing up just zeros.

You do the selections k*k times. Move them out if the loop, to do them only k times. Then vectorize all operations.

And to become much faster, try replacing the entire inner two loops with a matrix operation (which will run in C, not with two R interpreter loops...). Naively, a multiplication. But then realize that you are just doing a selection. So what you want to write is sum(kernelmatrix[selection,selection]), right?

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
  • Exactly I want observe in my data set wich observations that belongs to a cluster, and use that information to compute the elements from the kernel matrix. Thank you i will try do the modifications that you suggested – Mateus Maia May 31 '18 at 00:28
-1

This can be the start maybe:

data("iris")
iris <- as.data.frame(iris, stringsAsFactors = FALSE)
ans <- 1:nrow(iris)
third <- ans + as.numeric(iris[,'Sepal.Length']==5)*as.numeric(iris[,'Sepal.Length']==4)

But it is hard without data set and definition of kernel matrix

Mislav
  • 1,533
  • 16
  • 37