0

Setup For the purposes of my simulation, I'm generating a list of B=2000 elements, with each element being the output of a permutation procedure in which I first permute the rows of a 200x8000 matrix and for each column, I calculate the Kolmogorov-Smirnov test statistic between the first and second 100 rows (you can think of the first 100 rows as data from one group and the second 100 rows as data from another group).

Question This process takes a very long time (about 30-40 minutes) to generate the list. Is there a much faster way? In the future, I'd like to increase B to a larger value.

Code

B=2000
n.row=200; n.col=8000
#Generate sample data
samp.dat = matrix(rnorm(n.row*n.col),nrow=n.row)

perm.KS.list = NULL
for (b in 1:B){
    #permute the rows
    perm.dat.tmp = samp.dat[sample(nrow(samp.dat)),]
    #Compute the permutation-based test statistics
    perm.KS.list[[b]]= apply(perm.dat.tmp,2,function(y) ks.test.stat(y[1:100],y[101:200]))
}


#Modified KS-test function (from base package)
ks.test.stat <- function(x,y){
  x <- x[!is.na(x)]
  n <- length(x)
  y <- y[!is.na(y)]
  n.x <- as.double(n)
  n.y <- length(y)
  w <- c(x, y)
  z <- cumsum(ifelse(order(w) <= n.x, 1/n.x, -1/n.y))
  z <- z[c(which(diff(sort(w)) != 0), n.x + n.y)] #exclude ties
  STATISTIC <- max(abs(z))
  return(STATISTIC)
}
stats134711
  • 624
  • 1
  • 5
  • 15

1 Answers1

2

The 1:B loop has several places to optimize, but I agree that the real consumer is that inner function. Because you're simulating your well-behaved bootstrap samples, you can make two simplifying assumptions that the general base function can't:

  1. There aren't missing values. This obviates the is.na() adjustments
  2. The two sides (ie, x & y) have the same number of elements, so you don't need to count them separately. instead of splitting y in the loop, and them joining them back in the function (into w), just keep it together. The balanced sides also permit simplifications like remove the ifelse() clause. It produces a bunch of 0/1s, which are rescaled to -1/1s with integer arithmetic.

The function is reduced, which saves about 25% of the time. I added integers, instead of doubles inside cumsum().

ks.test.stat.balanced <- function(w){
  n     <- as.integer(length(w) * .5)
  # z   <- cumsum(ifelse(order(w) <= n, 1L, -1L)) / n
  z     <- cumsum((order(w) <= n)*2L - 1L) / n
  # z   <- z[c(which(diff(sort(w)) != 0), n + n)] #exclude ties
  return( max(abs(z)) )
}

Ties shouldn't occur often with your gaussian rng, and the diff(sort(.)) is very expensive. If you're willing to remove that protection, the time is reduced by about 65%.

If you move the equation for z into abs(), it saves a little time over all those reps. I kept it separate above, so it's easier to read.

edit in case of an unbalanced simulation I'd recommend you:

  1. still keep out the is.na,
  2. still pass w,
  3. still keep as much as possible in integer, not numeric, but
  4. now include arguments n1 & n2 for the two group sizes.

Also, experiment w/ precalculating 1/n before cumsum() to avoid a lot of expensive divisions. Try to think of other math-y ways to extract calculations from an inner loop so it occurs less frequently.

wibeasley
  • 5,000
  • 3
  • 34
  • 62
  • Thanks! I do want to have some flexibility in allowing `x` and `y` to have differing lengths, but for the purposes of simulation, I think it should be ok to have them equal. Would I keep my original code for the case of different lengths? – stats134711 Jul 06 '17 at 02:13
  • 1
    @stats134711, I added suggestions for unbalanced code in the answer. – wibeasley Jul 06 '17 at 03:01
  • Thanks that makes sense. It should be pretty easy to weight based on `n1` and `n2`. – stats134711 Jul 06 '17 at 03:02