1

I want to create a random subset of a data.table df that is very large (around 2 million lines). The data table has a weight column, wgt that indicates how many observation each line represents. To generate the vector of row numbers I want to extract, I proceed as follows:

I get the exact number of observations :

ns<- length(df$wgt)

I get the number of desired lines (30% of the sample):

lines<-round(0.3*ns)

I compute the vector of probabilities:

pr<-df$wgt/sum(df$wgt)

And then I compute the vector of line numbers to get the subsample:

ssout<-sample(1:ns, size=lines, probs=pr)

The final aim is to subset the data using df[ssout,]. However, R gets stuck when computing ssout.

Is there a faster/more efficient way to do this?

Thank you!

Frank
  • 66,179
  • 8
  • 96
  • 180
Doon_Bogan
  • 359
  • 5
  • 17
  • 1
    Using `sample.int` will trim a little bit off if you specify all the arguments, which will also force you to *not* create the `1:ns` vector in the first place (as @DavidArenburg suggested by skipping the `1:` part) – Gavin Simpson Jul 20 '15 at 17:39
  • 2
    Judging by your description ("wgt that indicates how many observation each line represents"), you should be sampling with replacement. If one line has a weight of ten percent, you should be able to draw it multiple times. – Frank Jul 20 '15 at 17:41
  • 1
    I guess this doesn't really have anything to do with data.table (which it's tagged with); I'm not sure though... – Frank Jul 20 '15 at 17:55
  • If you decide that you **do** want to sample without replacement, see http://stackoverflow.com/questions/15113650/faster-weighted-sampling-without-replacement (an amazing set of answers there!) – Ben Bolker Jul 20 '15 at 20:07

1 Answers1

3

I'm guessing that df is a summary description of a data set that has repeated observations (with wgt being the count of repetitions). In that case, the only useful way to sample from it would be with replacement; and a proper 30% sample would be 30% of the real population, .3*sum(wgt):

# example data
wgt <- sample(10,2e6,replace=TRUE)
nobs<- sum(wgt)
pr  <- wgt/sum(wgt)

# select rows
system.time(x <- sample.int(2e6,size=.3*nobs,prob=pr,replace=TRUE))
#    user  system elapsed 
#    0.20    0.02    0.22

Sampling rows without replacement takes forever on my computer, but is also something that I don't think one needs to do here.

Frank
  • 66,179
  • 8
  • 96
  • 180
  • 1
    +1; an example that shows why sampling without replacement is wrong is a scenario where all the weights are 0 except for one (or all are equal to 1, and one is ridiculously large). – eddi Jul 20 '15 at 18:43