Wow data.table
is really fast and seems to just work! Many of the values in successes
are repeated, so one can save time by doing the expensive binom.test
calculations just on the unique values.
fasterbinom <- function(x, ...) {
u <- unique(x)
idx <- match(x, u)
sapply(u, function(elt, ...) binom.test(elt, ...)$p.value, ...)[idx]
}
For some timings, we have
dtbinom <- function(x, ...) {
dt <- data.table(x)
dt[, pp:=binom.test(x, ...)$p.value, by=x]$pp
}
with
> successes <-rbinom(100000, 625, 1/5)
> identical(fasterbinom(successes, 625, .2), dtbinom(successes, 625, .2))
[1] TRUE
> library(rbenchmark)
> benchmark(fasterbinom(successes, 625, .2), dtbinom(successes, 625, .2))
test replications elapsed relative user.self
2 dtbinom(successes, 625, 0.2) 100 4.265 1.019 4.252
1 fasterbinom(successes, 625, 0.2) 100 4.184 1.000 4.124
sys.self user.child sys.child
2 0.008 0 0
1 0.052 0 0
It's interesting in this case to compare the looping approaches
f0 <- function(s, ...) {
x0 <-NULL
for (i in seq_along(s))
x0 <-append(x0, binom.test(s[i], ...)$p.value)
x0
}
f1 <- function(s, ...) {
x1 <- numeric(length(s))
for (i in seq_along(s))
x1[i] <- binom.test(s[i], ...)$p.value
x1
}
f2 <- function(s, ...)
sapply(s, function(x, ...) binom.test(x, ...)$p.value, ...)
f3 <- function(s, ...)
vapply(s, function(x, ...) binom.test(x, ...)$p.value, numeric(1), ...)
where f1
is a generally better 'pre-allocate and fill' strategy when using for
, f2
is an sapply
that removes the possibility of a poorly formulated for
loop from the user's grasp, and f3
is a safer and potentially faster version of sapply
that ensures each result is a length-1 numeric value.
Each function returns the same result
> n <- 1000
> xx <-rbinom(n, 625, 1/5)
> res0 <- f0(xx, 625, .2)
> identical(res0, f1(xx, 625, .2))
[1] TRUE
> identical(res0, f2(xx, 625, .2))
[1] TRUE
> identical(res0, f3(xx, 625, .2))
[1] TRUE
and while apply
-like methods are about 10% faster than the for loops (in this case; the difference between f0 and f1 can be much more dramatic when the individual elements are large)
> benchmark(f0(xx, 625, .2), f1(xx, 625, .2), f2(xx, 625, .2),
+ f3(xx, 625, .2), replications=5)
test replications elapsed relative user.self sys.self user.child
1 f0(xx, 625, 0.2) 5 2.303 1.100 2.300 0 0
2 f1(xx, 625, 0.2) 5 2.361 1.128 2.356 0 0
3 f2(xx, 625, 0.2) 5 2.093 1.000 2.088 0 0
4 f3(xx, 625, 0.2) 5 2.212 1.057 2.208 0 0
sys.child
1 0
2 0
3 0
4 0
the real speed is from the fancier algorithm of fasterbinom
/ dtbinom
.
> identical(res0, fasterbinom(xx, 625, .2))
[1] TRUE
> benchmark(f2(xx, 625, .2), fasterbinom(xx, 625, .2), replications=5)
test replications elapsed relative user.self sys.self
1 f2(xx, 625, 0.2) 5 2.146 16.258 2.145 0
2 fasterbinom(xx, 625, 0.2) 5 0.132 1.000 0.132 0
user.child sys.child
1 0 0
2 0 0