2

I need to extract 2Mil observations out of 23Mil data set. Using the code below it takes a lot of time to get it done. On Xeon CPU with 16GB RAM it's still running after 12 hours. I also noticed that the CPU is running at only 25% and HD is on 43%. How can I make the sampling process run faster? Attached is the two lines of code I'm using

prb <- ifelse(dat$target=='1', 1.0, 0.05)
smpl <- dat[sample(nrow(dat), 2000000, prob = prb), ]
pogibas
  • 27,303
  • 19
  • 84
  • 117
mql4beginner
  • 2,193
  • 5
  • 34
  • 73

3 Answers3

2

The sample function called with unequal probabilities and with replace = FALSE, probably doesn't exactly do what you want it to do: it draws one sample, then recalculates the remaining probabilities so that they add up to one, then draws one additional sample, etc. This makes is slow, and the probabilities don't match the original anymore.

One solution, in your case would be to divide your data set in two (target == '1' and target != '1') and calculate separate samples for each. You would only have to calculate how many elements you want to select in each group.

Another solution is to use the sampling methods from the sampling package. For example, systematic sampling:

library(sampling)

nsample <- 2E6

# Scale probabilities: add up to the number of elements we want
prb <- nsample/sum(prb) * prb

# Sample
smpl <- UPrandomsystematic(prb)

This takes approx 3 seconds on my system.

Checking the output:

> t <- table(smpl, prb)
> sum(smpl)
[1] 2e+06
> t[2,2]/t[2,1]
[1] 19.96854

We have indeed 2E6 records selected and the inclusion probabilities for target == 1 is 20 times smaller than for target != 1.

Jan van der Laan
  • 8,005
  • 1
  • 20
  • 35
  • I think your solution doesn't work because you have less '0.05' than the actual sampling with no replacement. Check with [this code](https://gist.github.com/privefl/6a1ca203624cd5025136e6ee9be6d776) – F. Privé Sep 17 '17 at 10:24
  • @F.Privé The code you pasted makes very little sense. Jan van der Laan has given the correct answer, which is to use proper without replacement sampling - Either systematic or Pareto sampling – DaBookshah Sep 17 '17 at 22:58
  • Thanks @Jan van der Laan, I tried your code it's super fast but I don't understand how to implement it on my data set as i don't see a place for entering the dataset and a place to set the sampling constrains, for example, 50% of target value ='1'. Any clarification would be great. – mql4beginner Sep 18 '17 at 07:49
1

The bottleneck is from the sampling, as just mentioned by Jan van der Laan.

A solution when you need to sample without replacement (and when the size is at least 5 times less than the initial size) is sampling with rejection. You could sample with replacement twice the number you need and take only the number of first unique values.

N <- 23e6
dat <- data.frame(
  target = sample(0:1, size = N, replace = TRUE),
  x = rnorm(N)
)      
prb <- ifelse(dat$target == 1, 1.0, 0.05)
n <- 2e6

Rcpp::sourceCpp('sample-fast.cpp')
sample_fast <- function(n, prb) {
  N <- length(prb)
  sample_more <- sample.int(N, size = 2 * n, prob = prb, replace = TRUE)
  get_first_unique(sample_more, N, n)
}

where 'sample-fast.cpp' contains

#include <Rcpp.h>
using namespace Rcpp;


// [[Rcpp::export]]
IntegerVector get_first_unique(const IntegerVector& ind_sample, int N, int n) {

  LogicalVector is_chosen(N);
  IntegerVector ind_chosen(n);

  int i, k, ind;

  for (k = 0, i = 0; i < n; i++) {
    do {
      ind = ind_sample[k++];
    } while (is_chosen[ind-1]);
    is_chosen[ind-1] = true;
    ind_chosen[i] = ind;
  }

  return ind_chosen;
}

Then you get:

system.time(ind <- sample_fast(n, prb))

in less than 1 second.

F. Privé
  • 11,423
  • 2
  • 27
  • 78
0

R is built to use only a single CPU core at a time. The easiest way to run your code multi-threaded is Microsoft R Open. I'm not quiet sure if it improves the performance of sampling too, but it's worth a shot. If not, multi-core packages like parallel or multicore may do the trick for you. The problem is that the multiple cores only work on some types of operations.

I can't say much about your code itself as it doesn't contain a reproducable example.

j3ypi
  • 1,497
  • 16
  • 21
  • With Microsoft R Open you wouldn't need to change any code. I don't know if sampling is supported for multi threading, though. – j3ypi Sep 17 '17 at 09:12
  • From what I know of MRO, it just uses an alternative linear algebra library (the MKL) so that only matrix operations are faster and parallelized, not any of the other computations. – F. Privé Sep 17 '17 at 10:15