I am implementing a kmeans algorithm in R however I'm having terrible performance issues. I come from python java and C++ so I'm not really used to code in the R way and so I wanted to know if I could get advice on basic operations to perform.
First is my function to get the distance between two points :
distance <- function(pt1, pt2){
pt1 <- pt1[0:NUMBER_OF_FEATURES]
pt2 <- pt2[0:NUMBER_OF_FEATURES]
pt2 <- t(pt2)
sum <- 0
counter <- 1
for (i in 1:nrow(pt2)){
sum <- sum + ((pt1[counter] - pt2[counter])^2)
counter <- counter + 1
}
value <- sqrt(sum)
return(value)
}
It doesn't look like I can do much better from what I understand, but I know I shouldn't really be using for loops in R.
Also I have another function that focuses on updating the centroids of each cluster and I coded it like this :
update_centroids <- function(ptlst, centroids){
centroids <- matrix(, nrow = NUMBER_OF_CLUSTERS, ncol = NUMBER_OF_FEATURES)
for (i in 1:NUMBER_OF_CLUSTERS){
temp <- ptlst[which(ptlst$cluster == i),]
temp <- temp[0:NUMBER_OF_FEATURES]
print(ncol(temp))
centroid <- c()
for (j in 1:ncol(temp)){
centroid <- c(centroid, mean(as.numeric(unlist(temp[j]))))
}
print(centroid)
centroids[i,] <- centroid
}
print(centroids)
}
Again, from what I understand, I shouldn't really be coding this part like this but use a general writing that would do this much faster.
Overall my full algorithm runs in 2.24 seconds on the iris dataset while my own implementation in python runs in 0.03 seconds
So I'm clearly doing something wrong here and there is something and takes a huge amount of time but i cannot get my hands on it
Thanks in advance for your answers, Shraneid
EDIT : dput generated file