Selecting frequent values from multiple columns from R table

Question

I have a data.table res which has data as follows:

            V1 V2 V3 V4
  1:     Day_1  4  4  4
  2:     Day_2  1  1  2
  3:     Day_3  4  5  4
  4:     Day_4  3  4  4
  5:     Day_5  3  2  3

I need to select the most frequent value from the columns V2, V3 and V4 combined. That is, I need to select result as follows:

Day_1 4
Day_2 1
Day_3 4
Day_4 4
Day_5 3

I'm not expecting any ties since the there will be always odd number of columns. Is it possible to manipulate the data.table to do this? Or should I modify it some other data type?

Thanks - V

[This](http://stackoverflow.com/questions/2547402/standard-library-function-in-r-for-finding-the-mode) is the function you looking for probably. — David Arenburg, Jun 23 '15 at 12:25
Try `apply(res[,-1, with=FALSE], 1, Mode)` `(Mode)`from the link — akrun, Jun 23 '15 at 12:25
the use of Mode (from the link) with the apply statement works perfectly well! Thank you. — visakh, Jun 23 '15 at 12:29
@akrun I would try `DT[, .(res = Mode(unlist(.SD))), by = V1]` instead. — David Arenburg, Jun 23 '15 at 12:31
How about `dt[V2 == V3 | V2 == V4, ans := V2][V3 == V4, ans := V3][]`? — Arun, Jun 23 '15 at 13:28
@Arun Given that there are no ties, I guess `dt[,ans:=V2][V2!=V3,ans:=V3]` is a little simpler. — Frank, Jun 23 '15 at 13:37
@DavidArenburg Oops, you're right; thanks. Should be `DT[,ans:=V2][V3==V4,ans:=V3]` — Frank, Jun 23 '15 at 14:34
@Frank that's ice for 3 columns. but it will get real messed up the columns grow. — David Arenburg, Jun 23 '15 at 14:42
@DavidArenburg Indeed. I was just trying to simplify what Arun suggested (but I failed on the first try, whoops). — Frank, Jun 23 '15 at 14:45
@Frank, I was hoping to distinguish the case where all three are unique, with `NA`. — Arun, Jun 23 '15 at 16:18

score 5 · Accepted Answer · edited May 23 '17 at 11:58

5

I'm posting this as a data.table version of this old question until something better is offered

Mode <- function(x) {
  ux <- unique(x)  
  ux[which.max(tabulate(match(x, ux)))]
}

DT[, .(res = Mode(unlist(.SD))), by = V1]

#       V1 res
# 1: Day_1   4
# 2: Day_2   1
# 3: Day_3   4
# 4: Day_4   4
# 5: Day_5   3

edited May 23 '17 at 11:58

Community

1
1

answered Jun 23 '15 at 12:57

David Arenburg

91,361
17
137
196

@David...thank you. I will wait for a little while longer before accepting the answer. – visakh Jun 23 '15 at 13:26

Frank · Answer 2 · 2015-06-24T02:42:32.750

Here are two variations on David's answer:

# table
DT[,ans:={
    tab <- table( r = rep(.I,length(.SD)), unlist(.SD) )
    as( colnames(tab)[ max.col(tab) ], class(.SD[[1]]) )
},.SDcols=-1]

# apply Mode
DT[,ans:=apply(.SD,1,Mode),.SDcols=-1]

I thought I'd take a look at these because splitting DT by row might be slow.

Speed.

Comparison against by-row. Starting out with very few unique values and rows...

n  <- 1e4
nv <- 5
nc <- 3

DT <- do.call(data.table,c(
  list(id=1:n),
  replicate(nc,sample(nv,n,replace=TRUE),simplify=FALSE)
))

require(rbenchmark)

benchmark(
  table     = DT[,.({
    tab <- table( r = rep(.I,length(.SD)), unlist(.SD) )
    as( colnames(tab)[ max.col(tab) ], class(.SD[[1]]) )
  }),.SDcols=-1],
  byMode    = DT[,.(Mode(unlist(.SD))), by = id],
  applyMode = DT[,.(apply(.SD,1,Mode)),.SDcols=-1],
  replications=10
)[1:5]
#        test replications elapsed relative user.self
# 3 applyMode           10    1.66    4.611      1.65
# 2    byMode           10    2.03    5.639      2.02
# 1     table           10    0.36    1.000      0.36

Increasing each of the three parameters separately...

nv <- 1e3 # up from 5

#        test replications elapsed relative user.self
# 3 applyMode           10    1.67    1.000      1.67
# 2    byMode           10    2.05    1.228      2.02
# 1     table           10    4.27    2.557      4.15

n  <- 5e4 # up from 1e4

#        test replications elapsed relative user.self
# 3 applyMode           10    8.67    4.492      8.65
# 2    byMode           10   10.21    5.290     10.22
# 1     table           10    1.93    1.000      1.92

nc <- 100 # up from 3

#        test replications elapsed relative user.self
# 3 applyMode           10    2.59    1.000      2.59
# 2    byMode           10    6.71    2.591      6.69
# 1     table           10   11.69    4.514     11.68

Discussion. (I'm comparing the elapsed columns across the benchmarks.)

nv. While table wins for the small case, it scales poorly with the number of unique values, since its tab object becomes very large. The other two methods are unaffected.
n. Everything scales linearly with the number of rows. (I expected table to scale worse on this dimension, too, but maybe n needs to be even larger for that.)
nc. table scales linearly with columns as well, while the other two do better.

Again, with melt

# back to original values for n, nv, nc
benchmark(
  table     = DT[,.({
    tab <- table( r = rep(.I,length(.SD)), unlist(.SD) )
    as( colnames(tab)[ max.col(tab) ], class(.SD[[1]]) )
  }),.SDcols=-1],
  byMode    = DT[,.(Mode(unlist(.SD))), by = id],
  applyMode = DT[,.(apply(.SD,1,Mode)),.SDcols=-1],
  melt      = melt(DT, id.vars = 'id')[, .N, by = .(id, value)][, 
                value[which.max(N)], by = id],
  melto     = melt(DT, id.vars = 'id')[, .N, by = .(id, value)][
                order(N)][,last(value),by=id],
  meltMode  = melt(DT, id.vars = 'id')[,Mode(value),by=id],
  replications=10
)[1:5]
#        test replications elapsed relative user.self
# 3 applyMode           10    2.42    8.643      2.36
# 2    byMode           10    2.84   10.143      2.81
# 4      melt           10    0.28    1.000      0.28
# 6  meltMode           10    1.92    6.857      1.81
# 5     melto           10    0.44    1.571      0.44
# 1     table           10    0.83    2.964      0.81

Looks like @eddi's melt with which.max wins.

Data.

DT <- 
  data.table(V1=paste("Day",1:5,sep="_"),V2=c(4,1,4,3,3),V3=c(4,1,5,4,2),V4=c(4,2,4,4,3))

eddi · Answer 3 · 2015-06-23T16:16:21.533

Convert to long form and then it's trivial to do:

dt <- data.table(id=paste("Day",1:5,sep="_"),V2=c(4,1,4,3,3),V3=c(4,1,5,4,2),V4=c(4,2,4,4,3))

melt(dt, id.vars = 'id')[, .N, by = .(id, value)][, value[which.max(N)], by = id]
#      id V1
#1: Day_1  4
#2: Day_2  1
#3: Day_3  4
#4: Day_4  4
#5: Day_5  3

This is significantly faster than the other options so far, as long as number of unique (id,value) pairs is not too large.

Selecting frequent values from multiple columns from R table

3 Answers3