Another answer with some notes of efficiency (although this QA is not about speed).
Firstly, it could be better to avoid the conversion of a "list"-y structure to a "matrix"; sometimes it's worth to convert to a "matrix" and use a function that handles efficiently a 'vector with a "dim" attribute' (i.e. a "matrix"/"array") - other times it's not. Both max.col
and apply
convert to a "matrix".
Secondly, in situations like these, where we do not need to check all the data while getting to a solution, we could benefit from a solution with a loop that controls what goes through to the next iteration. Here we know that we can stop when we've found the first "1". Both max.col
(and which.max
) have to loop once to, actually, find the maximum value; the fact that we know that "max == 1" is not taken advantage of.
Thirdly, match
is potentially slower when we seek only one value in another vector of values because match
's setup is rather complicated and costly:
x = 5; set.seed(199); tab = sample(1e6)
identical(match(x, tab), which.max(x == tab))
#[1] TRUE
microbenchmark::microbenchmark(match(x, tab), which.max(x == tab), times = 25)
#Unit: milliseconds
# expr min lq median uq max neval
# match(x, tab) 142.22327 142.50103 142.79737 143.19547 145.37669 25
# which.max(x == tab) 18.91427 18.93728 18.96225 19.58932 38.34253 25
To sum up, a way to work on the "list" structure of a "data.frame" and to stop computations when we find a "1", could be a loop like the following:
ff = function(x)
{
x = as.list(x)
ans = as.integer(x[[1]])
for(i in 2:length(x)) {
inds = ans == 0L
if(!any(inds)) return(ans)
ans[inds] = i * (x[[i]][inds] == 1)
}
return(ans)
}
And the solutions in the other answers (ignoring the extra steps for the output):
david = function(x) max.col(x, "first")
plafort = function(x) apply(x, 1, match, x = 1)
ff(df[-1])
#[1] 1 3 4 1
david(df[-1])
#[1] 1 3 4 1
plafort(df[-1])
#[1] 1 3 4 1
And some benchmarks:
set.seed(007)
DF = data.frame(id = seq_len(1e6),
"colnames<-"(matrix(sample(0:1, 1e7, T, c(0.25, 0.75)), 1e6),
paste("in", 11:20, sep = "")))
identical(ff(DF[-1]), david(DF[-1]))
#[1] TRUE
identical(ff(DF[-1]), plafort(DF[-1]))
#[1] TRUE
microbenchmark::microbenchmark(ff(DF[-1]), david(DF[-1]), as.matrix(DF[-1]), times = 30)
#Unit: milliseconds
# expr min lq median uq max neval
# ff(DF[-1]) 64.83577 65.45432 67.87486 70.32073 86.72838 30
# david(DF[-1]) 112.74108 115.12361 120.16118 132.04803 145.45819 30
# as.matrix(DF[-1]) 20.87947 22.01819 27.52460 32.60509 45.84561 30
system.time(plafort(DF[-1]))
# user system elapsed
# 4.117 0.000 4.125
Not really an apocalypse, but worth to see that simple, straightforward algorithmic approaches can -indeed- prove to be equally good or even better depending on the problem. Obviously, (most) other times looping in R can be laborious.