2

How can I speed up probability-weighted Sampling in R.

# Let's assume we are considering following example:
w <- sample(1:4000,size=2e6, replace=T)   

# "w" will be integer, so we are going to convert it to numeric.
w <- as.numeric(w)

# Actually the sampling process have to be repeated many times.
M <- matrix(NA, 10, 2000)
system.time(
for (r in 1:10){
  ix <- sample(1:2e6,size=2000,prob=w/sum(w))
  M[r,] <- ix
})
# It's worth it to mention that without "prob=w/sum(w)" sampling is considerably faster.
# The main goal is to speed up sampling with probability weights!
system.time(ix <- sample(1:2e6,size=2000,prob=w/sum(w)))

A weighted sampling takes 9.84 sec., normal sampling 0.01 sec. If you have any idea how it is possible to speed up weighted sampling, please feel open to answer.

And_R
  • 1,647
  • 3
  • 18
  • 32

1 Answers1

6

The speed problem is only limited to weighted sampling without replacement. Here's your code again, moving the parts unrelated to sample outside of the loop.

normalized_weights <- w/sum(w)
#No weights
system.time(
for (r in 1:10){
  ix <- sample(2e6, size = 2000)
})
#Weighted, no replacement
system.time(
for (r in 1:10){
  ix <- sample(2e6, size = 2000, prob = normalized_weights)
})
#Weighted with replacement
system.time(
for (r in 1:10){
  ix <- sample(2e6, size = 2000, replace = TRUE, prob = normalized_weights)
})

The big problem is that when you do weighted sampling without replacement, each time you pick a value, the weights need to be recalculated. See ?sample:

If 'replace' is false, these probabilities are applied sequentially, that is the probability of choosing the next item is proportional to the weights amongst the remaining items.

There may be faster solutions than using sample (I don't know how well it's been optimized) but it's a fundamentally more computationally intensive task than the unweighted/weighted-with-replacement sampling.

Richie Cotton
  • 118,240
  • 47
  • 247
  • 360
  • Weights don't need to be normalized for `sample`. – Roland Jan 21 '14 at 12:03
  • @Roland True, `sample` will do the normalising itself. (Indeed it does it lots of times for weighted sampling without replacement.) I was trying to let sample do the least work possible for the purposese of testing the speed. – Richie Cotton Jan 21 '14 at 13:36