The bottleneck is from the sampling, as just mentioned by Jan van der Laan.
A solution when you need to sample without replacement (and when the size is at least 5 times less than the initial size) is sampling with rejection.
You could sample with replacement twice the number you need and take only the number of first unique values.
N <- 23e6
dat <- data.frame(
target = sample(0:1, size = N, replace = TRUE),
x = rnorm(N)
)
prb <- ifelse(dat$target == 1, 1.0, 0.05)
n <- 2e6
Rcpp::sourceCpp('sample-fast.cpp')
sample_fast <- function(n, prb) {
N <- length(prb)
sample_more <- sample.int(N, size = 2 * n, prob = prb, replace = TRUE)
get_first_unique(sample_more, N, n)
}
where 'sample-fast.cpp' contains
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
IntegerVector get_first_unique(const IntegerVector& ind_sample, int N, int n) {
LogicalVector is_chosen(N);
IntegerVector ind_chosen(n);
int i, k, ind;
for (k = 0, i = 0; i < n; i++) {
do {
ind = ind_sample[k++];
} while (is_chosen[ind-1]);
is_chosen[ind-1] = true;
ind_chosen[i] = ind;
}
return ind_chosen;
}
Then you get:
system.time(ind <- sample_fast(n, prb))
in less than 1 second.