7

Foremost, I am looking for a fast(er) way of subsetting/indexing a matrix many, many times over:

for (i in 1:99000) {
  subset.data <- data[index[, i], ]
}

Background:
I'm implementing a sequential testing procedure involving the bootstrap in R. Wanting to replicate some simulation results, I came upon this bottleneck where lots of indexing needs to be done. For implementation of the block-bootstrap I created an index matrix with which I subset the original data matrix to draw resamples of the data.

# The basic setup

B <- 1000 # no. of bootstrap replications
n <- 250  # no. of observations
m <- 100  # no. of models/data series

# Create index matrix with B columns and n rows.
# Each column represents a resampling of the data.
# (actually block resamples, but doesn't matter here).

boot.index <- matrix(sample(1:n, n * B, replace=T), nrow=n, ncol=B)

# Make matrix with m data series of length n.

sample.data <- matrix(rnorm(n * m), nrow=n, ncol=m)

subsetMatrix <- function(data, index) { # fn definition for timing
  subset.data <- data[index, ]
  return(subset.data)
}

# check how long it takes.

Rprof("subsetMatrix.out")
for (i in 1:(m - 1)) { 
  for (b in 1:B) {  # B * (m - 1) = 1000 * 99 = 99000
    boot.data <- subsetMatrix(sample.data, boot.index[, b])
    # do some other stuff
  }
  # do some more stuff
}
Rprof()
summaryRprof("subsetMatrix.out")

# > summaryRprof("subsetMatrix.out")
# $by.self
#              self.time self.pct total.time total.pct
# subsetMatrix      9.96      100       9.96       100

# In the actual application:
#########
# > summaryRprof("seq_testing.out")
# $by.self
#              self.time self.pct total.time total.pct
# subsetMatrix       6.78    53.98       6.78     53.98
# colMeans           1.98    15.76       2.20     17.52
# makeIndex          1.08     8.60       2.12     16.88
# makeStats          0.66     5.25       9.66     76.91
# runif              0.60     4.78       0.72      5.73
# apply              0.30     2.39       0.42      3.34
# is.data.frame      0.22     1.75       0.22      1.75
# ceiling            0.18     1.43       0.18      1.43
# aperm.default      0.14     1.11       0.14      1.11
# array              0.12     0.96       0.12      0.96
# estimateMCS        0.10     0.80      12.56    100.00
# as.vector          0.10     0.80       0.10      0.80
# matrix             0.08     0.64       0.08      0.64
# lapply             0.06     0.48       0.06      0.48
# /                  0.04     0.32       0.04      0.32
# :                  0.04     0.32       0.04      0.32
# rowSums            0.04     0.32       0.04      0.32
# -                  0.02     0.16       0.02      0.16
# >                  0.02     0.16       0.02      0.16
#
# $by.total
#              total.time total.pct self.time self.pct
# estimateMCS        12.56    100.00      0.10     0.80
# makeStats           9.66     76.91      0.66     5.25
# subsetMatrix        6.78     53.98      6.78    53.98
# colMeans            2.20     17.52      1.98    15.76
# makeIndex           2.12     16.88      1.08     8.60
# runif               0.72      5.73      0.60     4.78
# doTest              0.68      5.41      0.00     0.00
# apply               0.42      3.34      0.30     2.39
# aperm               0.26      2.07      0.00     0.00
# is.data.frame       0.22      1.75      0.22     1.75
# sweep               0.20      1.59      0.00     0.00
# ceiling             0.18      1.43      0.18     1.43
# aperm.default       0.14      1.11      0.14     1.11
# array               0.12      0.96      0.12     0.96
# as.vector           0.10      0.80      0.10     0.80
# matrix              0.08      0.64      0.08     0.64
# lapply              0.06      0.48      0.06     0.48
# unlist              0.06      0.48      0.00     0.00
# /                   0.04      0.32      0.04     0.32
# :                   0.04      0.32      0.04     0.32
# rowSums             0.04      0.32      0.04     0.32
# -                   0.02      0.16      0.02     0.16
# >                   0.02      0.16      0.02     0.16
# mean                0.02      0.16      0.00     0.00
#
# $sample.interval
# [1] 0.02
#
# $sampling.time
# [1] 12.56'

Doing the sequential testing procedure once takes about 10 seconds. Using this in simulations with 2500 replications and several parameter constellations, it would take something like 40 days. Using parallel processing and better CPU power it's possible to do faster, but still not very pleasing :/

  • Is there a better way to resample the data / get rid of the loop?
  • Can apply, Vectorize, replicate etc. come in anywhere?
  • Would it make sense to implement the subsetting in C (e.g. manipulate some pointers)?

Even though every single step is already done incredibly fast by R, it's just not quite fast enough.
I'd be very glad indeed for any kind of response/help/advice!

related Qs:
- Fast matrix subsetting via '[': by rows, by columns or doesn't matter?
- fast function for generating bootstrap samples in matrix forms in R
- random sampling - matrix

from there

mapply(function(row) return(sample.data[row,]), row = boot.index)
replicate(B, apply(sample.data, 2, sample, replace = TRUE))

didn't really do it for me.

Community
  • 1
  • 1
Niels
  • 81
  • 5
  • 2
    `[` is extremely fast and unlikely to be the problem. Your first `summaryRprof` is a bit useless since the only thing you are doing there is use `subsetMatrix`. Your second `summaryRprof` might be revealing in showing that other operations like `lookupMatrix` or `colMeans` are taking much more time than `subsetMatrix` but you are not showing us enough of your code or profiles. That your code is generally slow is in my opinion the result of that double `for` loop. If your algorithm can be vectorized, we can help you. But we need to see your code and a reproducible example. – flodel Dec 08 '13 at 19:54
  • Thanks for your comments. @DWin, running the whole code works for me. – Niels Dec 08 '13 at 22:34
  • @flodel, I uploaded the code at [github](https://github.com/nm4k4/MCS/blob/master/MCS_bootstrap.R), but I didn't want to complicate things. Instead of the first `Rprof` a `system.time` would have done it. I only defined the whole `subsetMatrix` (same as `lookupMatrix`) function to measure the time it takes in the overall application. `[` involves making space in memory(?). Would it be faster to simply manipulate pointers in C? – Niels Dec 08 '13 at 22:44

1 Answers1

3

I rewrote makeStats and makeIndex as they were two of the biggest bottlenecks:

makeStats <- function(data, index) {

  data.mean <- colMeans(data)
  m <- nrow(data)
  n <- ncol(index)
  tabs <- lapply(1L:n, function(j)tabulate(index[, j], nbins = m))
  weights <- matrix(unlist(tabs), m, n) * (1 / nrow(index))
  boot.data.mean <- t(data) %*% weights - data.mean

  return(list(data.mean = data.mean,
              boot.data.mean = boot.data.mean))
}

makeIndex <- function(B, blocks){

  n <- ncol(blocks)
  l <- nrow(blocks)
  z <- ceiling(n/l)
  start.points <- sample.int(n, z * B, replace = TRUE)
  index <- blocks[, start.points]
  keep <- c(rep(TRUE, n), rep(FALSE, z*l - n))
  boot.index <- matrix(as.vector(index)[keep],
                       nrow = n, ncol = B)

  return(boot.index)
}

This brought down the computation times from 28 to 6 seconds on my machine. I bet there are other parts of the code that can be improved (including my use of lapply/tabulate above.)

flodel
  • 87,577
  • 21
  • 185
  • 223
  • This is awesome. Thank you so much!! I'll take it as an exercise to go through the rest of the code then. – Niels Dec 09 '13 at 09:41
  • No problem. Here is another one I noticed but was too lazy to try because you used it a lot: `sweep` is in general slower than using a double transpose: e.g. `sweep(x, 2, y, FUN = "-")` versus `t(t(x) - y)`. That's only recommended if `x` is a matrix, not data.frame. – flodel Dec 09 '13 at 12:54