I am looking for a way to speed up this algorithm.
My situation is as follows. I have a dataset with 25,000 users with 6 habits. My goal is to develop a hierarchical clustering for the 25,000 users. I run this on a server with 16 cores, 128GB RAM. It took me 3 weeks just for 10,000 users using 6 cores non-stop on my server to calculate this distance matrix. As you can imagine this is too long for my research.
For each of the 6 habits I have created a probability mass distribution (PMF). The PMFs may differ in size (columns) per per habbit. Some habits have 10 columns some 256, all depending on the user with most unbahitual behavior.
The first step in my algrithm is to develop a distance matrix. I use the Hellinger distance to calculate the distance, which is contrary to some packages that use e.g. cathersian/Manhattan. I do need the Hellinger distance, see https://en.wikipedia.org/wiki/Hellinger_distance
What I currently tried is to speed up the algorithm by applying a multicore proces, 6 habits each on a seperate core. Two things that may be beneficial for speed up
(1) C implementation - but I have no idea how to do this (I am not a C programmer) Could you help me on this C implementation if this would be helpful?
(2) make a carthesian product by joining on the table by itself and have all rows and thereafte do a rowwise calculation. The point there is that R gives an error by default in e.g. data.table. Any suggestions for this?
Any other thoughts?
Best Regards Jurjen
# example for 1 habit with 100 users and a PMF of 5 columns
Habit1<-data.frame(col1=abs(rnorm(100)),
col2=abs(c(rnorm(20),runif(50),rep(0.4,20),sample(seq(0.01,0.99,by=0.01),10))),
col3=abs(c(rnorm(30),runif(30),rep(0.4,10),sample(seq(0.01,0.99,by=0.01),30))),
col4=abs(c(rnorm(10),runif(10),rep(0.4,20),sample(seq(0.01,0.99,by=0.01),60))),
col5=abs(c(rnorm(50),runif(10),rep(0.4,10),sample(seq(0.01,0.99,by=0.01),30))))
# give all users a username same as rowname
rownames(Habit1)<- c(1:100)
# actual calculation
Result<-calculatedistances(Habit1)
HellingerDistance <-function(x){
#takes two equal sized vectors and calculates the hellinger distance between the vectors
# hellinger distance function
return(sqrt(sum(((sqrt(x[1,]) - sqrt(x[2,]))^2)))/sqrt(2))
}
calculatedistances <- function(x){
# takes a dataframe of user IID in the first column and a set of N values per user thereafter
# first set all NA to 0
x[is.na(x)] <- 0
#create matrix of 2 subsets based on rownumber
# 1 first the diagronal with
D<-cbind(matrix(rep(1:nrow(x),each=2),nrow=2),combn(1:nrow(x), 2))
# create a dataframe with hellinger distances
B <<-data.frame(first=rownames(x)[D[1,]],
second=rownames(x)[D[2,]],
distance=apply(D, 2, function(y) HellingerDistance(x[ y,]))
)
# reshape dataframe into a matrix with users on x and y axis
B<<-reshape(B, direction="wide", idvar="second", timevar="first")
# convert wide table to distance table object
d <<- as.dist(B[,-1], diag = FALSE)
attr(d, "Labels") <- B[, 1]
return(d)
}