Fastest way to compute the cdf of the Normal distribution over vectors - R::pnorm vs erfc vs?

Question

I hope my reworded question now fits the criteria of Stackoverflow. Please consider the example below. I am writing a Log-Likelihood function in which computing the cdf over vectors is the most time consuming part. Example 1 uses the R::pnorm, Example 2 approximates the normal cdf with erfc. As you can see the results are sufficiently similar, the ercf version is a bit faster.

In practice (within an MLE) however it turns out that the ercf is not as precise, which lets the algorithm run into inf areas unless one sets the constraints accurately. My questions:

1) Am I missing something? Is it necessary to implement some error handling (for the erfc)?

2) Do you have any other suggestions to speed up the code, or alternatives? Does it pay off to look into parallelizing the for-loop?

require(Rcpp)
require(RcppArmadillo)
require(microbenchmark)

#Example 1 : standard R::pnorm
src1 <- '
NumericVector ppnorm(const arma::vec& x,const arma::vec& mu,const     arma::vec& sigma, int lt, int lg) {
int n = x.size();
arma::vec res(n);
for (int i=0; i<n; i++) {
   res(i) = R::pnorm(x(i),mu(i),sigma(i),lt,lg);
}
return wrap(res);
}
'

#Example 2: approximation with ercf
src2 <- '
NumericVector ppnorm(const arma::vec& x,const arma::vec& mu,const    arma::vec& sigma, int lt, int lg) {
int n = x.size();
arma::vec res(n);
for (int i=0; i<n; i++) {
res(i) = 0.5 * erfc(-(x(i) - mu(i))/sigma(i) * M_SQRT1_2);
}
if (lt==0 & lg==0) {
   return wrap(1 - res);
}
if (lt==1 & lg==0) {
   return wrap(res);
}
if (lt==0 & lg==1) {
   return wrap(log(1 - res));
}
if (lt==1 & lg==1) {
   return wrap(log(res));
}
}
'

#some random numbers
xex  = rnorm(100,5,4)
muex = rnorm(100,3,1)
siex = rnorm(100,0.8,0.3)

#compile c++ functions 
func1 = cppFunction(depends = "RcppArmadillo",code=src1) #R::pnorm
func2 = cppFunction(depends = "RcppArmadillo",code=src2) #ercf

#run with exemplaric data
res1 = func1(xex,muex,siex,1,0)
res2 = func2(xex,muex,siex,1,0)

# sum of squared errors
sum((res1 - res2)^2,na.rm=T)
# 6.474419e-32 ... very small

#benchmarking
 microbenchmark(func1(xex,muex,siex,1,0),func2(xex,muex,siex,1,0),times=10000)
#Unit: microseconds
#expr    min      lq     mean median     uq     max neval
#func1(xex, muex, siex, 1, 0) 11.225 11.9725 13.72518 12.460 13.617 103.654 10000
#func2(xex, muex, siex, 1, 0)  8.360  9.1410 10.62114  9.669 10.769 205.784 10000
#my machine: Ubuntu 14.04 LTS, i7 2640M 2.8 Ghz x 4, 8GB memory, RRO 3.2.0 based on version R 3.2.0

Please provide representative examples of `a` and `ind` and the function you are trying to apply. — nrussell, May 19 '15 at 23:11
Did you read [this Rcpp Gallery post on subsetting](http://gallery.rcpp.org/articles/armadillo-subsetting/) ? — Dirk Eddelbuettel, May 20 '15 at 02:25
@DirkEddelbuettel I of course read these examples and I hope that my question makes clear that I am aware of the find(x==1) syntax. I wanted to subset and iterate over in one step without saving the object in between, because I was under the impression that this speeds up my code. I now found now a way to circumvent the problem. I will edit the question. — chameau13, May 20 '15 at 14:16
If you read the post, why do you apply the Armadillo functions on Rcpp types? Also your example is neither complete nor reproducible -- whereas our documentation at the Rcpp Gallery is both. I would start there and modify accordingly. — Dirk Eddelbuettel, May 20 '15 at 14:36
@DirkEddelbuettel I now posted my whole code, maybe this makes things clearer. I am converting everything to arma::mat or arma::vec at the beginning and the function to apply pnorm over the vectors is written to accept arma objects. I am really sorry if I am committing basic mistakes, but I started looking into C++ three days ago. — chameau13, May 20 '15 at 14:47
We use the term _minimally reproducible example_ a lot. Your codedump is neither minimal nor (for lack of input data) reproducible. — Dirk Eddelbuettel, May 20 '15 at 14:53
@DirkEddelbuettel I think there are questions questions on Stackoverflow with much larger codedumps than these 51 lines and I wanted to make the context clear, I mean I am referencing the relevant parts of the code in the text. I will try however to supply the data somehow. My original question however boils down to: Is it possible in Armadillo to create sequential subsets (subsets of subsets of etc..) in one step? — chameau13, May 20 '15 at 15:00
@DirkEddelbuettel I now completely rewrote the question. Unfortunately I wasn't find ad hoc a way to pull the dropbox link into R programmatically in the code. — chameau13, May 20 '15 at 16:34
So I completely rewrote my question and now, without any comment, it is put on hold? I guess the people who voted to delete it before didn't even review the new version. — chameau13, May 20 '15 at 20:57
I tested the code again as it is presented in the question, it runs without any problem. — chameau13, May 20 '15 at 21:03
@DirkEddelbuettel I again completely rewrote the question. Please consider to vote for reopening. — chameau13, May 22 '15 at 18:14
Looks much better but _3) Could the vectorized Rcpp::pnorm be an alternative? Is it as accurate?_ makes no sense. Where do you think the vectorized `pnorm()` comes from? I simply loops for you, otherwise it is the same function and code ... — Dirk Eddelbuettel, May 22 '15 at 18:20
Multiple comments lecturing me how to ask my question, but not a single comment on the question itself. Sad. — chameau13, May 24 '15 at 03:29

score 6 · Accepted Answer · answered May 26 '15 at 09:39

6

1) Well, you really should use R's pnorm() as your 0-th example. You don't, you use the Rcpp interface to it. R's pnorm() is already nicely vectorized R-internally (i.e. on C level) so may well be comparative or even faster than Rcpp. Also it does have the advantage to cover cases of NA, NaN, Inf, etc..

2) If you are talking about MLE, and you are concerned about speed and accuracy, you almost surely should rather work with the logarithms, and maybe not withpnorm() but rather dnorm() ?

answered May 26 '15 at 09:39

Martin Mächler

4,619
27
27

1

Thanks for your reply: 1) By the comment of Dirk Eddelbuettel above I assume that the vectorized version of pnorm and the code I am actually writing out are equivalent. 2) I am using logs of course. Maybe I was sloppy with my language and should have written that this is going to be part of a log-likelihood function. But can you outline how I could work with dnorm and what adavantages this would have? The MLE is a hierarchical ordered probit model and until I read your comment I always was under the impression that I need the c.d.f. thus pnorm for it.... – chameau13 May 26 '15 at 12:12
1

If indeed you need a probit model you do need `pnorm()`. OTOH, many statisticians do prefer "the default" logit to probit ... one reason being that the likelihood becomes much easier, including its gradient, etc. So, for your problem, you could compute the logit solution first (which has a "cheap" log likelihood) and then use its solution as "initial starting point" for the probit one which is more expensive. I would recommend remaining with a pnorm() based computation. The speed loss is less than 50% so why should you care? Reliability and portability is more important I'd say. – Martin Mächler May 29 '15 at 18:55

Fastest way to compute the cdf of the Normal distribution over vectors - R::pnorm vs erfc vs?

1 Answers1