sample() command is too slow in R

Question

I want to create a random subset of a data.table df that is very large (around 2 million lines). The data table has a weight column, wgt that indicates how many observation each line represents. To generate the vector of row numbers I want to extract, I proceed as follows:

I get the exact number of observations :

ns<- length(df$wgt)

I get the number of desired lines (30% of the sample):

lines<-round(0.3*ns)

I compute the vector of probabilities:

pr<-df$wgt/sum(df$wgt)

And then I compute the vector of line numbers to get the subsample:

ssout<-sample(1:ns, size=lines, probs=pr)

The final aim is to subset the data using df[ssout,]. However, R gets stuck when computing ssout.

Is there a faster/more efficient way to do this?

Thank you!

Using `sample.int` will trim a little bit off if you specify all the arguments, which will also force you to *not* create the `1:ns` vector in the first place (as @DavidArenburg suggested by skipping the `1:` part) — Gavin Simpson, Jul 20 '15 at 17:39
Judging by your description ("wgt that indicates how many observation each line represents"), you should be sampling with replacement. If one line has a weight of ten percent, you should be able to draw it multiple times. — Frank, Jul 20 '15 at 17:41
I guess this doesn't really have anything to do with data.table (which it's tagged with); I'm not sure though... — Frank, Jul 20 '15 at 17:55
If you decide that you **do** want to sample without replacement, see http://stackoverflow.com/questions/15113650/faster-weighted-sampling-without-replacement (an amazing set of answers there!) — Ben Bolker, Jul 20 '15 at 20:07

Frank · Answer 1 · 2015-07-20T18:34:09.113

3

I'm guessing that df is a summary description of a data set that has repeated observations (with wgt being the count of repetitions). In that case, the only useful way to sample from it would be with replacement; and a proper 30% sample would be 30% of the real population, .3*sum(wgt):

# example data
wgt <- sample(10,2e6,replace=TRUE)
nobs<- sum(wgt)
pr  <- wgt/sum(wgt)

# select rows
system.time(x <- sample.int(2e6,size=.3*nobs,prob=pr,replace=TRUE))
#    user  system elapsed 
#    0.20    0.02    0.22

Sampling rows without replacement takes forever on my computer, but is also something that I don't think one needs to do here.

edited Jul 20 '15 at 18:34

answered Jul 20 '15 at 17:54

Frank

66,179
8
96
180

1

+1; an example that shows why sampling without replacement is wrong is a scenario where all the weights are 0 except for one (or all are equal to 1, and one is ridiculously large). – eddi Jul 20 '15 at 18:43

sample() command is too slow in R

1 Answers1