I want to create a random subset of a data.table
df
that is very large (around 2 million lines).
The data table has a weight column, wgt
that indicates how many observation each line represents.
To generate the vector of row numbers I want to extract, I proceed as follows:
I get the exact number of observations :
ns<- length(df$wgt)
I get the number of desired lines (30% of the sample):
lines<-round(0.3*ns)
I compute the vector of probabilities:
pr<-df$wgt/sum(df$wgt)
And then I compute the vector of line numbers to get the subsample:
ssout<-sample(1:ns, size=lines, probs=pr)
The final aim is to subset the data using df[ssout,]
. However, R gets stuck when computing ssout
.
Is there a faster/more efficient way to do this?
Thank you!