2

I am looking for a way to speed up this algorithm.

My situation is as follows. I have a dataset with 25,000 users with 6 habits. My goal is to develop a hierarchical clustering for the 25,000 users. I run this on a server with 16 cores, 128GB RAM. It took me 3 weeks just for 10,000 users using 6 cores non-stop on my server to calculate this distance matrix. As you can imagine this is too long for my research.

For each of the 6 habits I have created a probability mass distribution (PMF). The PMFs may differ in size (columns) per per habbit. Some habits have 10 columns some 256, all depending on the user with most unbahitual behavior.

The first step in my algrithm is to develop a distance matrix. I use the Hellinger distance to calculate the distance, which is contrary to some packages that use e.g. cathersian/Manhattan. I do need the Hellinger distance, see https://en.wikipedia.org/wiki/Hellinger_distance

What I currently tried is to speed up the algorithm by applying a multicore proces, 6 habits each on a seperate core. Two things that may be beneficial for speed up

(1) C implementation - but I have no idea how to do this (I am not a C programmer) Could you help me on this C implementation if this would be helpful?

(2) make a carthesian product by joining on the table by itself and have all rows and thereafte do a rowwise calculation. The point there is that R gives an error by default in e.g. data.table. Any suggestions for this?

Any other thoughts?

Best Regards Jurjen

# example for 1 habit with 100 users and a PMF of 5 columns
Habit1<-data.frame(col1=abs(rnorm(100)),
               col2=abs(c(rnorm(20),runif(50),rep(0.4,20),sample(seq(0.01,0.99,by=0.01),10))),
               col3=abs(c(rnorm(30),runif(30),rep(0.4,10),sample(seq(0.01,0.99,by=0.01),30))), 
               col4=abs(c(rnorm(10),runif(10),rep(0.4,20),sample(seq(0.01,0.99,by=0.01),60))),
               col5=abs(c(rnorm(50),runif(10),rep(0.4,10),sample(seq(0.01,0.99,by=0.01),30))))

  # give all users a username same as rowname 
  rownames(Habit1)<- c(1:100)

  # actual calculation  
  Result<-calculatedistances(Habit1)



         HellingerDistance <-function(x){
           #takes two equal sized vectors and calculates the hellinger distance between the vectors

           # hellinger distance function
           return(sqrt(sum(((sqrt(x[1,]) - sqrt(x[2,]))^2)))/sqrt(2))

         }


       calculatedistances <- function(x){
         # takes a dataframe of user IID in the first column and a set of N values per user thereafter 

         # first set all NA to 0
         x[is.na(x)] <- 0



         #create matrix of 2 subsets based on rownumber
         # 1 first the diagronal with 
         D<-cbind(matrix(rep(1:nrow(x),each=2),nrow=2),combn(1:nrow(x), 2))

         # create a dataframe with hellinger distances
         B <<-data.frame(first=rownames(x)[D[1,]],
                        second=rownames(x)[D[2,]],
                        distance=apply(D, 2, function(y) HellingerDistance(x[ y,]))
         )


         # reshape dataframe into a matrix with users on x and y axis
         B<<-reshape(B, direction="wide", idvar="second", timevar="first")

         # convert wide table to distance table object
         d <<- as.dist(B[,-1], diag = FALSE)
         attr(d, "Labels") <- B[, 1]
         return(d)

       }
JR Helmus
  • 311
  • 1
  • 12
  • 1
    I would suggest (1) change your matrix to `long` format, (2) use `data.table` to calculate between pairs of observations, (3) convert results back to a matrix on `wide` format if necessary. [Here is the most efficient way I've found so far to calculate distances between data points using this approach](https://stackoverflow.com/questions/36817423/how-to-efficiently-calculate-distance-between-pair-of-coordinates-using-data-tab) – rafa.pereira Jun 13 '17 at 13:56
  • Thanks for your answer, I do not fully understand your solution neither the example in the link. The link shows a solution for spatial distances and not Hellinger distances. 1. the long format of the data is that as it is in Habit, is that what you meant? 2. how to best implement `data.table` to calculate between pairs of observations? Thanks for yous answer – JR Helmus Jun 13 '17 at 14:32
  • There is a `hellinger` function in R. Have you considered using that? – akash87 Jun 13 '17 at 16:56
  • I considered this function, yet the normal `Hellinger` takes functions of distributions rather dan a discrete distribution itself. Therefore I had to write my own function. But thanks for the suggestion. – JR Helmus Jun 13 '17 at 20:49

2 Answers2

2

I understand this is not a complete answer, but this suggestion is too long for a comment.

Here is how I would go about using data.table to speed up the process. The way it stands, this code still does not achieve what you requested maybe because I'm not entirely sure what you want but hopefully this will give a clear idea of how to proceed from here.

Also, you might wanna take a look at the HellingerDist{distrEx} function to calculate Hellinger Distance.

library(data.table)

# convert Habit1 into a data.table
  setDT(Habit1)

# assign ids instead of working with rownames
  Habit1[, id := 1:100] 

# replace NAs with 0
  for (j in seq_len(ncol(Habit1)))
    set(Habit1, which(is.na(Habit1[[j]])),j,0)

# convert all values to numeric
  for (k in seq_along(Habit1)) set(Habit1, j = k, value = as.numeric(Habit1[[k]]))


# get all possible combinations of id pairs in long format
  D <- cbind(matrix(rep(1:nrow(Habit1),each=2),nrow=2),combn(1:nrow(Habit1), 2))
  D <- as.data.table(D)
  D <- transpose(D)


# add to this dataset the probability mass distribution (PMF) of each id V1 and V2
# this solution dynamically adapts to number of columns in each Habit dataset
  colnumber <- ncol(Habit1) - 1
  cols <- paste0('i.col',1:colnumber) 

  D[Habit1, c(paste0("id1_col",1:colnumber)) := mget(cols ), on=.(V1 = id)]
  D[Habit1, c(paste0("id2_col",1:colnumber)) := mget(cols ), on=.(V2 = id)]


# [STATIC] calculate hellinger distance 
D[, H := sqrt(sum(((sqrt(c(id1_col1,  id1_col2,  id1_col3,  id1_col4,   id1_col5)) - sqrt(c(id2_col1,  id2_col2,  id2_col3,  id2_col4,   id2_col5)))^2)))/sqrt(2) , by = .(V1, V2)]

Now, if you want to make this flexible to the number of columns in each habit data set:

# get names of columns
  part1 <- names(D)[names(D) %like% "id1"]
  part2 <- names(D)[names(D) %like% "id2"]

# calculate distance 
  D[, H2 := sqrt(sum(((sqrt( .SD[, ..part1] ) - sqrt( .SD[, ..part2] ))^2)))/sqrt(2) , by = .(V1,V2) ] 

Now, for a much faster distance calculation

# change 1st colnames to avoid conflict 
  names(D)[1:2] <- c('x', 'y')

# [dynamic] calculate hellinger distance
  D[melt(D, measure = patterns("^id1", "^id2"), value.name = c("v", "f"))[
  , sqrt(sum(((sqrt( v ) - sqrt( f ))^2)))/sqrt(2), by=.(x,y)], H3 := V1,  on = .(x,y)]

# same results
#> identical(D$H, D$H2, D$H3)
#> [1] TRUE
rafa.pereira
  • 13,251
  • 6
  • 71
  • 109
  • Thanks for the great answer I will try to implement this tonight. I looked at the `HellingerDist{distrEx}` function but somewhere in the process I decided to use my own function, the thing is I can remember why. – JR Helmus Jun 13 '17 at 18:47
  • I now tried to implement your solution, but indeed it not fully results what I need. I do have some questions on your code. How can I make the `list( i.col1, i.col2, i.col3, i.col4, i.col5 )` dynamic? I need this, since some habits have 256 values whereas other may only have 10. And the algorithm needs to be dynamic. Next, the proposed `H` indeed is not correct and should be dynamic as well. would it be an option to create a matrix from both the `id[n]_col[n]` and pass this to the Hellinger distance function in the other solution? thanks – JR Helmus Jun 15 '17 at 08:28
  • first issue solved `cols<-paste0('i.col',1:5) D[Habit1, c(paste0("id1_col",1:5)) := mget(cols ), on=.(V1 = id)]` – JR Helmus Jun 15 '17 at 10:49
  • Thanks for the incorpotation. My l (not yet optimized version) now rund in 11 minutes for 10,000 users instead of 3 weeks. The `sqrt(sum(((sqrt(c(id1_col1, id1_col2, id1_col3, id1_col4, id1_col5)) - sqrt(c(id2_col1, id2_col2, id2_col3, id2_col4, id2_col5)))^2)))/sqrt(2)` can be made dynamic by the same `mget()` function is not it? – JR Helmus Jun 15 '17 at 13:34
  • I've made a few changes to address your comments. check if the results are what you would get with your original method. – rafa.pereira Jun 15 '17 at 15:42
1

The first thing to optimize code is profiling. By profiling the code you provided, it seems that the main bottleneck is HellingerDistance function.

  • Improve algorithm. In your HellingerDistancefunction, it can be seen when calculating distance of each pair, you recalculate the square-root each time, which is a total waste of time. So here is the improved version, calculatedistances1 is the new function, it first calculate the square-root of x and use new HellingerDistanceSqrt to calculate Hellinger distance, it can be seen the new version speeds up 40%.

  • Improve data structure. I also notice that your x in your original calulatedistance function is a data.frame which overloads too much, so I transform it to a matrix by as.matrix which makes the code faster by more than a magnitude.

Finally, the new calculatedistances1 is more than 70 times faster than the original version on my machine.

# example for 1 habit with 100 users and a PMF of 5 columns
Habit1<-data.frame(col1=abs(rnorm(100)),
                   col2=abs(c(rnorm(20),runif(50),rep(0.4,20),sample(seq(0.01,0.99,by=0.01),10))),
                   col3=abs(c(rnorm(30),runif(30),rep(0.4,10),sample(seq(0.01,0.99,by=0.01),30))),
                   col4=abs(c(rnorm(10),runif(10),rep(0.4,20),sample(seq(0.01,0.99,by=0.01),60))),
                   col5=abs(c(rnorm(50),runif(10),rep(0.4,10),sample(seq(0.01,0.99,by=0.01),30))))

# give all users a username same as rowname
rownames(Habit1)<- c(1:100)

HellingerDistance <-function(x){
    #takes two equal sized vectors and calculates the hellinger distance between the vectors

    # hellinger distance function
    return(sqrt(sum(((sqrt(x[1,]) - sqrt(x[2,]))^2)))/sqrt(2))

}

HellingerDistanceSqrt <-function(sqrtx){
    #takes two equal sized vectors and calculates the hellinger distance between the vectors

    # hellinger distance function
    return(sqrt(sum(((sqrtx[1,] - sqrtx[2,])^2)))/sqrt(2))

}

calculatedistances <- function(x){
    # takes a dataframe of user IID in the first column and a set of N values per user thereafter

    # first set all NA to 0
    x[is.na(x)] <- 0



    #create matrix of 2 subsets based on rownumber
    # 1 first the diagronal with
    D<-cbind(matrix(rep(1:nrow(x),each=2),nrow=2),combn(1:nrow(x), 2))

    # create a dataframe with hellinger distances
    B <<-data.frame(first=rownames(x)[D[1,]],
                    second=rownames(x)[D[2,]],
                    distance=apply(D, 2, function(y) HellingerDistance(x[ y,]))
    )


    # reshape dataframe into a matrix with users on x and y axis
    B<<-reshape(B, direction="wide", idvar="second", timevar="first")

    # convert wide table to distance table object
    d <<- as.dist(B[,-1], diag = FALSE)
    attr(d, "Labels") <- B[, 1]
    return(d)

}


calculatedistances1 <- function(x){
    # takes a dataframe of user IID in the first column and a set of N values per user thereafter

    # first set all NA to 0
    x[is.na(x)] <- 0

    x <- sqrt(as.matrix(x))



    #create matrix of 2 subsets based on rownumber
    # 1 first the diagronal with
    D<-cbind(matrix(rep(1:nrow(x),each=2),nrow=2),combn(1:nrow(x), 2))

    # create a dataframe with hellinger distances
    B <<-data.frame(first=rownames(x)[D[1,]],
                    second=rownames(x)[D[2,]],
                    distance=apply(D, 2, function(y) HellingerDistanceSqrt(x[ y,]))
    )


    # reshape dataframe into a matrix with users on x and y axis
    B<<-reshape(B, direction="wide", idvar="second", timevar="first")

    # convert wide table to distance table object
    d <<- as.dist(B[,-1], diag = FALSE)
    attr(d, "Labels") <- B[, 1]
    return(d)

}

# actual calculation
system.time(Result<-calculatedistances(Habit1))
system.time(Result1<-calculatedistances1(Habit1))
identical(Result, Result1)
Consistency
  • 2,884
  • 15
  • 23
  • Thanks for this great answer as well. I indeed forgot to profile the function. As soon as the function passed some test results I implemented it and ran it on the whole dataset. The consequence was that I did not want to disturb the calculation process, so I waited until it stopped... with unlucky results. Thanks and I will indeed implement your solution as well. – JR Helmus Jun 13 '17 at 18:51