10

I would like to sample 50,000 values from the normal distribution with mean = 0 and sd -1. But I want to limit the values to [-3,3]. I've written code to do this, but not sure if it is most efficient? Was hoping to get some suggestions.

lower <- -3 
upper <- 3
x_norm<-rnorm(75000,0,1)
x_norm<-x_norm[which(x_norm >=lower & x_norm<=upper)]
repeat{
    x_norm<-c(x_norm, rnorm(10000,0,1))
    x_norm<-x_norm[which(x_norm >=lower & x_norm<=upper)]
    if(length(x_norm) >= 50000){break}
}
x_norm<-x_norm[1:50000]
Ben Bolker
  • 211,554
  • 25
  • 370
  • 453
user1375871
  • 1,199
  • 1
  • 12
  • 23

3 Answers3

15

Something like your code will most definitely work but you're greatly overestimating how many values you need. Given that it's a known distribution and a fairly large number of samples you know how many will appear more or less than 3.

(1-pnorm(3))*2 * 50000
[1] 134.9898

So, given you likely only get about 135 out of range in a draw of 50,000 it's pretty easy to draw a few more than that, but still not an inordinately larger number and trim it. Just take the first 50,000 of 50,500 that are less than or greater than 3.

x <- rnorm(50500)
x <- x[x < 3 & x > -3]
x <- x[1:50000]

I ran the first 2 lines 40,000 times and it returned a length greater than 50000 every time. A small boolean check could guarantee it always does.

x <- 1
while (length(x) < 50000){
    x <- rnorm(50500)
    x <- x[x < 3 & x > -3]}
x <- x[1:50000]

For me this executes almost 100% of the time in 6 ms. It's a simple way to do it in R that executes very fast, is easy to read, and doesn't require add ons.

John
  • 23,360
  • 7
  • 57
  • 83
11

If you really care about efficiency this short piece of Rcpp code will be hard to beat. Store the following in a file, say /tmp/rnormClamp.cpp:

#include <Rcpp.h>

using namespace Rcpp;

// [[Rcpp::export]]
NumericVector rnormClamp(int N, int mi, int ma) {
    NumericVector X = rnorm(N, 0, 1);
    return clamp(mi, X, ma);
}

/*** R
  system.time(X <- rnormClamp(50000, -3, 3))
  summary(X)
*/

Use sourceCpp() (from Rcpp as well) to build and run it. The actual draw and clamping takes about 4 milliseconds on my computer:

R> sourceCpp("/tmp/rnormClamp.cpp")

R>   system.time(X <- rnormClamp(50000, -3, 3))
   user  system elapsed 
  0.004   0.000   0.004 

R>   summary(X)
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
-3.00000 -0.67300 -0.00528  0.00122  0.68500  3.00000 
R> 

The clamp() sugar function was featured in this previous SO answer by Romain, which also notes that you want version 0.10.2 of Rcpp.

Edit: Per Ben's hint, I seemed to have misunderstood. Here is a mix of C++ and R:

// [[Rcpp::export]]
List rnormSelect(int N, int mi, int ma) {
  RNGScope scope;
  int N2 = N * 1.25;
  NumericVector X = rnorm(N2, 0, 1);
  LogicalVector ind = (X < mi) | (X > ma);
  return List::create(X, ind);
}

which one can append to the earlier file. Then:

R>   system.time({ Z <- rnormSelect(50000, -3, 3); 
+                  X <- Z[[1]][ ! Z[[2]] ]; X <- X[1:50000]})
   user  system elapsed 
  0.008   0.000   0.009 

R>   summary(X)
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
-3.00000 -0.68200 -0.00066 -0.00276  0.66800  3.00000 
R> 

I'll revisit to the logical indexing and the row subset which I'll have to look up. Maybe tomorrow. But 9 milliseconds is still not too bad :)

Edit 2: Looks like we really don't have logical indexing. We'll have to add this. This version does it 'by hand' but is not that much faster than indexing from R:

// [[Rcpp::export]]
NumericVector rnormSelect2(int N, int mi, int ma) {
  RNGScope scope;
  int N2 = N * 1.25;
  NumericVector X = rnorm(N2, 0, 1);
  LogicalVector ind = (X >= mi) & (X <= ma);
  NumericVector Y(N);
  int k=0;
  for (int i=0; i<N2 & k<N; i++) {
    if (ind[i]) Y(k++) = X(i);
  }
  return Y;
}

And the output:

R>   system.time(X <- rnormSelect2(50000, -3, 3)) 
   user  system elapsed 
  0.004   0.000   0.007 

R>   summary(X)
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
-2.99000 -0.66900 -0.00258  0.00223  0.66700  2.99000 

R>   length(X)
[1] 50000
R> 
Community
  • 1
  • 1
Dirk Eddelbuettel
  • 360,940
  • 56
  • 644
  • 725
  • I think the OP doesn't want to clamp, but to draw a larger-than-needed sample and discard out-of-range values ... at least, that's what their example does. – Ben Bolker Dec 25 '12 at 21:53
  • Oh, I see, guess I missed that in the rush before the guest come for dinner :). Well whatever -- works the same way as Rcpp sugar makes it *trivially* easy to evaluate such booleans. So do as he does and compute N*(1 + fudge) values and just index out the ones that do not "fit". I presume there are analytical results for such truncated normal distros as well... – Dirk Eddelbuettel Dec 25 '12 at 21:56
11

John and Dirk gave nice examples of rejection sampling, which should be fine for the given question. But to give another approach, when you have the cumulative distribution function and its inverse (or reasonable approximations thereof) you can just generate data from a uniform distribution and transform:

x <- qnorm( runif(50000, pnorm(-3), pnorm(3)) )
range(x)
hist(x)

For the given question I don't expect this to be much better (if any better) than the rejection sampling methods, but if you wanted to generate data between 2 and 3 from a truncated normal 0,1 then this method would probably be much more efficient. It does depend on the cumulative and its inverse (pnorm and qnorm in this case) and so would not be as simple as the rejection sampling for a distribution without those easily available.

Greg Snow
  • 48,497
  • 6
  • 83
  • 110
  • I guess I was just thinking the way he was doing it through more thoroughly rather than thinking of the best way to do it. – John Dec 27 '12 at 02:58
  • @John, but the best way to do it depends on the problem (and others with similar but different problems could find our answers), in some cases your answer will be better, in some mine. Sometimes another answer may be best. Searchers can see both our answers and decide for themselves which is better. – Greg Snow Dec 27 '12 at 17:08
  • This answer deserves more upvotes – John Smith Apr 12 '23 at 15:48