4

I have a data.table res which has data as follows:

            V1 V2 V3 V4
  1:     Day_1  4  4  4
  2:     Day_2  1  1  2
  3:     Day_3  4  5  4
  4:     Day_4  3  4  4
  5:     Day_5  3  2  3

I need to select the most frequent value from the columns V2, V3 and V4 combined. That is, I need to select result as follows:

Day_1 4
Day_2 1
Day_3 4
Day_4 4
Day_5 3

I'm not expecting any ties since the there will be always odd number of columns. Is it possible to manipulate the data.table to do this? Or should I modify it some other data type?

Thanks - V

Frank
  • 66,179
  • 8
  • 96
  • 180
visakh
  • 2,503
  • 8
  • 29
  • 55

3 Answers3

5

I'm posting this as a data.table version of this old question until something better is offered

Mode <- function(x) {
  ux <- unique(x)  
  ux[which.max(tabulate(match(x, ux)))]
}

DT[, .(res = Mode(unlist(.SD))), by = V1]

#       V1 res
# 1: Day_1   4
# 2: Day_2   1
# 3: Day_3   4
# 4: Day_4   4
# 5: Day_5   3
Community
  • 1
  • 1
David Arenburg
  • 91,361
  • 17
  • 137
  • 196
3

Here are two variations on David's answer:

# table
DT[,ans:={
    tab <- table( r = rep(.I,length(.SD)), unlist(.SD) )
    as( colnames(tab)[ max.col(tab) ], class(.SD[[1]]) )
},.SDcols=-1]

# apply Mode
DT[,ans:=apply(.SD,1,Mode),.SDcols=-1]

I thought I'd take a look at these because splitting DT by row might be slow.

Speed.

Comparison against by-row. Starting out with very few unique values and rows...

n  <- 1e4
nv <- 5
nc <- 3

DT <- do.call(data.table,c(
  list(id=1:n),
  replicate(nc,sample(nv,n,replace=TRUE),simplify=FALSE)
))

require(rbenchmark)

benchmark(
  table     = DT[,.({
    tab <- table( r = rep(.I,length(.SD)), unlist(.SD) )
    as( colnames(tab)[ max.col(tab) ], class(.SD[[1]]) )
  }),.SDcols=-1],
  byMode    = DT[,.(Mode(unlist(.SD))), by = id],
  applyMode = DT[,.(apply(.SD,1,Mode)),.SDcols=-1],
  replications=10
)[1:5]
#        test replications elapsed relative user.self
# 3 applyMode           10    1.66    4.611      1.65
# 2    byMode           10    2.03    5.639      2.02
# 1     table           10    0.36    1.000      0.36

Increasing each of the three parameters separately...

nv <- 1e3 # up from 5

#        test replications elapsed relative user.self
# 3 applyMode           10    1.67    1.000      1.67
# 2    byMode           10    2.05    1.228      2.02
# 1     table           10    4.27    2.557      4.15

n  <- 5e4 # up from 1e4

#        test replications elapsed relative user.self
# 3 applyMode           10    8.67    4.492      8.65
# 2    byMode           10   10.21    5.290     10.22
# 1     table           10    1.93    1.000      1.92

nc <- 100 # up from 3

#        test replications elapsed relative user.self
# 3 applyMode           10    2.59    1.000      2.59
# 2    byMode           10    6.71    2.591      6.69
# 1     table           10   11.69    4.514     11.68

Discussion. (I'm comparing the elapsed columns across the benchmarks.)

  • nv. While table wins for the small case, it scales poorly with the number of unique values, since its tab object becomes very large. The other two methods are unaffected.
  • n. Everything scales linearly with the number of rows. (I expected table to scale worse on this dimension, too, but maybe n needs to be even larger for that.)
  • nc. table scales linearly with columns as well, while the other two do better.

Again, with melt

# back to original values for n, nv, nc
benchmark(
  table     = DT[,.({
    tab <- table( r = rep(.I,length(.SD)), unlist(.SD) )
    as( colnames(tab)[ max.col(tab) ], class(.SD[[1]]) )
  }),.SDcols=-1],
  byMode    = DT[,.(Mode(unlist(.SD))), by = id],
  applyMode = DT[,.(apply(.SD,1,Mode)),.SDcols=-1],
  melt      = melt(DT, id.vars = 'id')[, .N, by = .(id, value)][, 
                value[which.max(N)], by = id],
  melto     = melt(DT, id.vars = 'id')[, .N, by = .(id, value)][
                order(N)][,last(value),by=id],
  meltMode  = melt(DT, id.vars = 'id')[,Mode(value),by=id],
  replications=10
)[1:5]
#        test replications elapsed relative user.self
# 3 applyMode           10    2.42    8.643      2.36
# 2    byMode           10    2.84   10.143      2.81
# 4      melt           10    0.28    1.000      0.28
# 6  meltMode           10    1.92    6.857      1.81
# 5     melto           10    0.44    1.571      0.44
# 1     table           10    0.83    2.964      0.81

Looks like @eddi's melt with which.max wins.

Data.

DT <- 
  data.table(V1=paste("Day",1:5,sep="_"),V2=c(4,1,4,3,3),V3=c(4,1,5,4,2),V4=c(4,2,4,4,3))
Frank
  • 66,179
  • 8
  • 96
  • 180
3

Convert to long form and then it's trivial to do:

dt <- data.table(id=paste("Day",1:5,sep="_"),V2=c(4,1,4,3,3),V3=c(4,1,5,4,2),V4=c(4,2,4,4,3))

melt(dt, id.vars = 'id')[, .N, by = .(id, value)][, value[which.max(N)], by = id]
#      id V1
#1: Day_1  4
#2: Day_2  1
#3: Day_3  4
#4: Day_4  4
#5: Day_5  3

This is significantly faster than the other options so far, as long as number of unique (id,value) pairs is not too large.

eddi
  • 49,088
  • 6
  • 104
  • 155