2

I am trying to find an efficient way to get a row wise modes on a subset of columns in data.table

#Sample data    
a <- data.frame( 
        id=letters[], 
        dattyp1 = sample( 1:2, 26, replace=T) , 
        dattyp2 = sample( 1:2, 26, replace=T) , 
        dattyp3 = sample( 1:2, 26, replace=T) ,
        dattyp4 = sample( 1:2, 26, replace=T) , 
        dattyp5 = sample( 1:2, 26, replace=T) , 
        dattyp6 = sample( 1:2, 26, replace=T)
        )

    library(modeest)
    library(data.table)

I know from: To find "row wise" "Mode" of a given data in R that I can do this:

Mode <- function(x) {
     ux <- unique(x)
          ux[which.max(tabulate(match(x, ux)))]
    }   

apply(a[ ,paste0("dattyp",1:6)], 1, Mode)

But this is really slow (over my millions of records). I am thinking there must be a way to do it with .SDcols - but this does column wise modes not row wise.

a<- data.table( a )
    a[ , lapply(.SD , mfv ), .SDcols=c(paste0("dattyp",1:6) ) ]
MatthewR
  • 2,660
  • 5
  • 26
  • 37
  • 1
    Is the example -- _6 binary integer columns with ~1e7 rows_-- provided representative of your actual dataset? The optimal solution will probably vary depending on the column types, _(i.e. integers can be handled in different ways than character strings)_, the number of columns, cardinality, and row count. _(I took a quick stab at it and couldn't come up with anything faster than your original solution, and fwiw, `modeest::mfv()` seems to be slower than the user defined function `Mode()`)_ – Matt Summersgill May 16 '18 at 21:47
  • I agree, mfv is slower than the mode function the OP defined. – mgriebe May 16 '18 at 22:03
  • https://stackoverflow.com/questions/36916080/r-fast-mode-function-for-use-in-data-table-lapply-sd-mode-by – mgriebe May 16 '18 at 22:06

2 Answers2

2

I think the fastest way via is still to convert into a relational (i.e. long) format and aggregate and then find max in reldtMtd function as follows. I wonder if using Rcpp will be faster.

data:

library(data.table)
M <- 1e6
popn <- 2
set.seed(0L)
a <- data.frame( 
    id=1:M, 
    dattyp1 = sample(popn, M, replace=TRUE), 
    dattyp2 = sample(popn, M, replace=TRUE), 
    dattyp3 = sample(popn, M, replace=TRUE),
    dattyp4 = sample(popn, M, replace=TRUE), 
    dattyp5 = sample(popn, M, replace=TRUE), 
    dattyp6 = sample(popn, M, replace=TRUE)
)    
setDT(a)

methods:

reldtMtd <- function() {
    melt(a, id.vars="id")[, 
        .N, by=.(id, value)][,
            value[which.max(N)], by=.(id)] 
}

#from https://stackoverflow.com/a/8189441/1989480
Mode <- compiler::cmpfun(function(x) {   
    ux <- unique(x)
    ux[which.max(tabulate(match(x, ux)))]
})
Mode2 <- compiler::cmpfun(function(x) names(which.max(table(x))))
matA <- as.matrix(a[, -1L])

baseMtd1 <- function() apply(matA, 1, Mode)
baseMtd2 <- function() apply(matA, 1, Mode2)

library(microbenchmark)
microbenchmark(reldtMtd(), baseMtd1(), baseMtd2(), times=3L)

timings:

Unit: seconds
       expr        min         lq       mean     median         uq       max neval
 reldtMtd()   1.882783   1.947515   2.031767   2.012248   2.106259   2.20027     3
 baseMtd1()  15.618716  15.675314  15.809277  15.731913  15.904557  16.07720     3
 baseMtd2() 160.837513 161.692634 162.455048 162.547755 163.263816 163.97988     3
chinsoon12
  • 25,005
  • 4
  • 25
  • 35
1

You can try this -though I am not sure how much faster it will be. Note, I am grabbing the first number returned by mfv.

library(modeest)
library(data.table)

a <- data.frame( 
  id=letters[], 
  dattyp1 = sample( 1:2, 26, replace=T) , 
  dattyp2 = sample( 1:2, 26, replace=T) , 
  dattyp3 = sample( 1:2, 26, replace=T) ,
  dattyp4 = sample( 1:2, 26, replace=T) , 
  dattyp5 = sample( 1:2, 26, replace=T) , 
  dattyp6 = sample( 1:2, 26, replace=T)
)


a<- data.table( a )

a[ , Mode:=mfv(c(dattyp1,dattyp2,dattyp3,dattyp4,dattyp5,dattyp6))[1],by=id ]

datatable could be faster. Apply:

microbenchmark(apply={
+   apply(a[ ,paste0("dattyp",1:6)], 1, Mode)
+ })
Unit: microseconds
  expr     min      lq     mean  median      uq      max neval
 apply 574.025 591.803 1056.807 624.988 704.396 39236.79   100

datatable by:

microbenchmark({
+   a[ , Mode:=mfv(c(dattyp1,dattyp2,dattyp3,dattyp4,dattyp5,dattyp6))[1],by=id ]
+ })
Unit: milliseconds
                                                                                                       expr     min       lq
 {     a[, `:=`(Mode, mfv(c(dattyp1, dattyp2, dattyp3, dattyp4,          dattyp5, dattyp6))[1]), by = id] } 2.44109 2.748053
     mean   median       uq      max neval
 3.049809 2.898769 3.139559 6.398032   100
mgriebe
  • 908
  • 5
  • 8
  • 1
    A benchmark with 26 rows and 6 columns is propably measuring overhead. It can not seriously be taken as a valid comparison of methods for a use case of more than 1 M of rows. – Uwe May 16 '18 at 23:06