Here are two variations on David's answer:
# table
DT[,ans:={
tab <- table( r = rep(.I,length(.SD)), unlist(.SD) )
as( colnames(tab)[ max.col(tab) ], class(.SD[[1]]) )
},.SDcols=-1]
# apply Mode
DT[,ans:=apply(.SD,1,Mode),.SDcols=-1]
I thought I'd take a look at these because splitting DT
by row might be slow.
Speed.
Comparison against by-row. Starting out with very few unique values and rows...
n <- 1e4
nv <- 5
nc <- 3
DT <- do.call(data.table,c(
list(id=1:n),
replicate(nc,sample(nv,n,replace=TRUE),simplify=FALSE)
))
require(rbenchmark)
benchmark(
table = DT[,.({
tab <- table( r = rep(.I,length(.SD)), unlist(.SD) )
as( colnames(tab)[ max.col(tab) ], class(.SD[[1]]) )
}),.SDcols=-1],
byMode = DT[,.(Mode(unlist(.SD))), by = id],
applyMode = DT[,.(apply(.SD,1,Mode)),.SDcols=-1],
replications=10
)[1:5]
# test replications elapsed relative user.self
# 3 applyMode 10 1.66 4.611 1.65
# 2 byMode 10 2.03 5.639 2.02
# 1 table 10 0.36 1.000 0.36
Increasing each of the three parameters separately...
nv <- 1e3 # up from 5
# test replications elapsed relative user.self
# 3 applyMode 10 1.67 1.000 1.67
# 2 byMode 10 2.05 1.228 2.02
# 1 table 10 4.27 2.557 4.15
n <- 5e4 # up from 1e4
# test replications elapsed relative user.self
# 3 applyMode 10 8.67 4.492 8.65
# 2 byMode 10 10.21 5.290 10.22
# 1 table 10 1.93 1.000 1.92
nc <- 100 # up from 3
# test replications elapsed relative user.self
# 3 applyMode 10 2.59 1.000 2.59
# 2 byMode 10 6.71 2.591 6.69
# 1 table 10 11.69 4.514 11.68
Discussion. (I'm comparing the elapsed
columns across the benchmarks.)
- nv. While
table
wins for the small case, it scales poorly with the number of unique values, since its tab
object becomes very large. The other two methods are unaffected.
- n. Everything scales linearly with the number of rows. (I expected
table
to scale worse on this dimension, too, but maybe n
needs to be even larger for that.)
- nc.
table
scales linearly with columns as well, while the other two do better.
Again, with melt
# back to original values for n, nv, nc
benchmark(
table = DT[,.({
tab <- table( r = rep(.I,length(.SD)), unlist(.SD) )
as( colnames(tab)[ max.col(tab) ], class(.SD[[1]]) )
}),.SDcols=-1],
byMode = DT[,.(Mode(unlist(.SD))), by = id],
applyMode = DT[,.(apply(.SD,1,Mode)),.SDcols=-1],
melt = melt(DT, id.vars = 'id')[, .N, by = .(id, value)][,
value[which.max(N)], by = id],
melto = melt(DT, id.vars = 'id')[, .N, by = .(id, value)][
order(N)][,last(value),by=id],
meltMode = melt(DT, id.vars = 'id')[,Mode(value),by=id],
replications=10
)[1:5]
# test replications elapsed relative user.self
# 3 applyMode 10 2.42 8.643 2.36
# 2 byMode 10 2.84 10.143 2.81
# 4 melt 10 0.28 1.000 0.28
# 6 meltMode 10 1.92 6.857 1.81
# 5 melto 10 0.44 1.571 0.44
# 1 table 10 0.83 2.964 0.81
Looks like @eddi's melt
with which.max
wins.
Data.
DT <-
data.table(V1=paste("Day",1:5,sep="_"),V2=c(4,1,4,3,3),V3=c(4,1,5,4,2),V4=c(4,2,4,4,3))