4

I have a vector of successes, and want to conduct a binom.test on each of the values. Is there a faster method than this loop (I have quite alot):

successes <-rbinom(100, 625, 1/5)
x <-NULL
for (i in 1:100) {
x <-append(x, binom.test(successes[i], 625, 1/5)$p.value)
}
biggob1
  • 43
  • 5

2 Answers2

9

Instead of for loop you can use sapply() to calculate p.values for each value of successes.

pp <- sapply(successes, function(x) binom.test(x, 625, 1/5)$p.value)

If you need real speed-up of process you can use advantages of package data.table. First, convert successes to data.table object. Then calculate for each row p.value.

library(data.table)
dt<-data.table(successes)
dt[,pp:=binom.test(successes, 625, 1/5)$p.value,by=successes]
Martin Morgan
  • 45,935
  • 7
  • 84
  • 112
Didzis Elferts
  • 95,661
  • 14
  • 264
  • 201
  • Will this be significantly faster than the loop? Thanks – biggob1 Mar 18 '13 at 12:02
  • @biggob1 sapply solution won't be faster but you can use data.table solution which will be much faster - see updated answer. – Didzis Elferts Mar 18 '13 at 12:11
  • 2
    @DidzisElferts I'm assuming the sapply solution will be faster than a loop in this case but only because the way they programmed the loop was horrible inefficient. – Dason Mar 18 '13 at 18:05
4

Wow data.table is really fast and seems to just work! Many of the values in successes are repeated, so one can save time by doing the expensive binom.test calculations just on the unique values.

fasterbinom <- function(x, ...) {
    u <- unique(x)
    idx <- match(x, u)
    sapply(u, function(elt, ...) binom.test(elt, ...)$p.value, ...)[idx]
}

For some timings, we have

dtbinom <- function(x, ...) {
    dt <- data.table(x)
    dt[, pp:=binom.test(x, ...)$p.value, by=x]$pp
}

with

> successes <-rbinom(100000, 625, 1/5)
> identical(fasterbinom(successes, 625, .2), dtbinom(successes, 625, .2))
[1] TRUE
> library(rbenchmark)
> benchmark(fasterbinom(successes, 625, .2), dtbinom(successes, 625, .2))
                              test replications elapsed relative user.self
2     dtbinom(successes, 625, 0.2)          100   4.265    1.019     4.252
1 fasterbinom(successes, 625, 0.2)          100   4.184    1.000     4.124
  sys.self user.child sys.child
2    0.008          0         0
1    0.052          0         0

It's interesting in this case to compare the looping approaches

f0 <- function(s, ...) {
    x0 <-NULL
    for (i in seq_along(s))
        x0 <-append(x0, binom.test(s[i], ...)$p.value)
    x0
}

f1 <- function(s, ...) {
    x1 <- numeric(length(s))
    for (i in seq_along(s))
        x1[i] <-  binom.test(s[i], ...)$p.value
    x1
}

f2 <- function(s, ...)
    sapply(s, function(x, ...) binom.test(x, ...)$p.value, ...)

f3 <- function(s, ...)
    vapply(s, function(x, ...) binom.test(x, ...)$p.value, numeric(1), ...)

where f1 is a generally better 'pre-allocate and fill' strategy when using for, f2 is an sapply that removes the possibility of a poorly formulated for loop from the user's grasp, and f3 is a safer and potentially faster version of sapply that ensures each result is a length-1 numeric value.

Each function returns the same result

> n <- 1000
> xx <-rbinom(n, 625, 1/5)
> res0 <- f0(xx, 625, .2)
> identical(res0, f1(xx, 625, .2))
[1] TRUE
> identical(res0, f2(xx, 625, .2))
[1] TRUE
> identical(res0, f3(xx, 625, .2))
[1] TRUE

and while apply-like methods are about 10% faster than the for loops (in this case; the difference between f0 and f1 can be much more dramatic when the individual elements are large)

> benchmark(f0(xx, 625, .2), f1(xx, 625, .2), f2(xx, 625, .2),
+           f3(xx, 625, .2), replications=5)
              test replications elapsed relative user.self sys.self user.child
1 f0(xx, 625, 0.2)            5   2.303    1.100     2.300        0          0
2 f1(xx, 625, 0.2)            5   2.361    1.128     2.356        0          0
3 f2(xx, 625, 0.2)            5   2.093    1.000     2.088        0          0
4 f3(xx, 625, 0.2)            5   2.212    1.057     2.208        0          0
  sys.child
1         0
2         0
3         0
4         0

the real speed is from the fancier algorithm of fasterbinom / dtbinom.

> identical(res0, fasterbinom(xx, 625, .2))
[1] TRUE
> benchmark(f2(xx, 625, .2), fasterbinom(xx, 625, .2), replications=5)
                       test replications elapsed relative user.self sys.self
1          f2(xx, 625, 0.2)            5   2.146   16.258     2.145        0
2 fasterbinom(xx, 625, 0.2)            5   0.132    1.000     0.132        0
  user.child sys.child
1          0         0
2          0         0
Martin Morgan
  • 45,935
  • 7
  • 84
  • 112
  • Hi. I didn't quite understand why you said wow near the top; aren't the times for `dtbinom` and `fasterbinom` about the same at around 4.1 seconds? That's 4.1 seconds for `replications=100`, so each task took 0.04 seconds. `data.table` isn't really for small tasks, so I'm a little confused here. There's overhead of `[.data.table` and the `data.table()` call inside `dtbinom` would normally be things that would degrade performance when repeated in a loop on small tasks. – Matt Dowle Mar 20 '13 at 00:09
  • @MatthewDowle I was being impressed -- data.table, like fasterbinom, is 1100x faster than say the naive f2 for this sample size, and they both scale really well with size (not shown). It took me some mental thinking to figure out how to make fasterbinom work faster, whereas data.table was fast out of the box. Obviously you're doing the right thing in a way that's intuitive to your users... – Martin Morgan Mar 20 '13 at 00:32
  • Ah, I see now. I missed that both were faster than the naive. Thanks. – Matt Dowle Mar 20 '13 at 10:34