-1

I'm have tried to do a Kernel K-Means algorithm in R manually but my loop is taking more than 30 minutes to run, here's the code:

#Calculanting kernel k-means
rbfkmeans<-function(data,c,q=0.02,L=0.7){ 
        #associating random classifications to each observation
        iter=0
        data<-data%>%
                mutate(cluster=sample(1:c,nrow(data),replace=T))

        mini=rep(1,nrow(data))

        ## DISTÂNCIA EUCLIDIANA  
        # Remember: 
        #1.|| a || = sqrt(aDOTa), 
        #2. d(x,y) = || x - y || = sqrt((x-y)DOT(x-y))
        #3. aDOTb = sum(a*b)


        d<-function(x,y){
                aux=x-y
                dis=sqrt(sum(aux*aux))
                return(dis)
        }

        ##Radial Basis Function Kernel
        # Remember :
        # 1.K(x,x')=exp(-q||x-x'||^2) where ||x-x'|| is could be defined as the
        # euclidian distance and 'q' it's the gamma parameter
        rbf<-function(x,y,q=0.2){
                aux<-d(x,y)
                rbfd<-exp(-q*(aux)^2)
                return(rbfd)
        }
        #
        #calculating the kernel matrix
        kernelmatrix=matrix(0,nrow(data),nrow(data))
        for(i in 1:nrow(data)){
                for(j in 1:nrow(data)){
                        kernelmatrix[i,j]=rbf(data[i,1:(ncol(data)-1)],data[j,1:(ncol(data)-1)],q)
                }
        }
        r=rep(0,nrow(data))
        distance=matrix(0,nrow(data),c)
        while(  (sum(r==data[,'cluster'])!=nrow(data)) && iter <30 ){   
                ans=0

                #Calculating the distaces in the kernelized versions (RBF example)
                print('running')
                third=rep(0,c)#here third means the calculation from centers distances
                #as they not depend of each obserativion.
                for(g in 1:c){
                        ans=0
                        for(k in 1:nrow(data)){
                                for(l in 1:nrow(data)){
                                        ans = ans + (data[k,'cluster']==g)*(data[l,'cluster']==g)*kernelmatrix[k,l]

                                }
                        }
                        third[g]=ans
                }      
                for (ii in 1:nrow(data)){       #for (ii in 1:nrow(data))
                        for(j in 1:c)  {          #for(j in 1:c)
                                distance[ii,j]= kernelmatrix[ii,ii]-2*sum((data[,'cluster']==j)*kernelmatrix[ii,])/sum(data[,'cluster']==j)+third[j]/(sum(data[,'cluster']==j)^2)
                        }
                }
                r=data[,'cluster']
                #Checking the shortest distance
                for(k in 1:nrow(data)){
                        data[k,'cluster']=match(min(distance[k,]),distance[k,])
                        mini[k]=min(distance[k,])
                }  
                plot(data[1:(ncol(data)-1)], col=data$cluster)
                iter=iter+1  
                print(paste('Iteration number:',iter))
                print(paste('Mean of min. distances:',mean(mini)))

                #print(g==data$'cluster')
        }

        return(data)
}

Someone have any idea how i can optmize this? The main problem it's the calculation of the #third term, I guess that it's wasting too much time with the verification of (data[k,'cluster']==g) inside the loops, but I dont have more ideas to improve that...

OBS: The data[k,'cluster']==g, its to verify if the observation belongs to the cluster.

EDIT: The part of the code that is taking long time to run its this:

for(g in 1:c){
                            ans=0
                            for(k in 1:nrow(data)){
                                    for(l in 1:nrow(data)){
                                            ans = ans + (data[k,'cluster']==g)*(data[l,'cluster']==g)*kernelmatrix[k,l]

                                    }
                            }
                            third[g]=ans
                    }    
  • 2
    Hi Mateus, a reproducible example of your code (see [here](https://stackoverflow.com/help/mcve)) will help us answering you. – Thomas Guillerme May 28 '18 at 03:44
  • See also [How to make a great R reproducible example?](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) – Tung May 28 '18 at 04:43
  • The R interpreter *is* very slow. Rewrite critical code parts in a C library instead. Avoid the use of any kind of interpreter loop. Try to use vectorised operations, because the loops hidden in there will usually run in C or Fortran - much faster than R loops. – Has QUIT--Anony-Mousse Jun 02 '18 at 09:30

1 Answers1

1

Looks like you can optimize your distance and radial function. Your distance gets the sqrt of the sum, bur your radial function squares it negating it

    d<-function(x,y){
          aux=x-y
          sum(aux*aux)
    }

    ##Radial Basis Function Kernel
    # Remember :
    # 1.K(x,x')=exp(-q||x-x'||^2) where ||x-x'|| is could be defined as the
    # euclidian distance and 'q' it's the gamma parameter
    rbf<-function(x,y,q=0.2){
            aux<-d(x,y)
            rbfd<-exp(-q*(aux))
            return(rbfd)
    }

Also you should be able to use convert your code to use foreach loops and the be able to take advantage of one of the parallelization libraries (such as doparallel)

Carlos Santillan
  • 1,077
  • 7
  • 8