2
n<-100000   
aa<-rnorm(n)
bb<-rnorm(n)
system.time(lapply(aa, function(z){mean(bb<pnorm(z))}))

It takes too long to run this small code. Simply put, I have two vectors aa and bb. For each element of aa, say aa[i], I want the proportion of bb < aa[i]

I found this article and tried to use it to speed up. But it does not work. Speed comparison of sapply with a composite function

Any help will be appreciated!

Community
  • 1
  • 1
NJmonkey
  • 21
  • 2

3 Answers3

7

You may be able to use the findInterval function:

n <- 25000
aa <- rnorm(n)
bb <- rnorm(n)
system.time(q1 <- lapply(aa, function(z){mean(bb<pnorm(z))}))
#   user  system elapsed
# 20.057   2.544  22.807
system.time(q2 <- findInterval(pnorm(aa), sort(bb))/n)
#   user  system elapsed
#  0.020   0.000   0.021
all.equal(as.vector(q1, "numeric"), q2)
# [1] TRUE

Note that findInterval returns indices, so I've divided the result by n. If you can sort pnorm(aa) before giving it to findInterval, it will be even faster.

Andy
  • 4,789
  • 23
  • 20
  • 1
    Fantastic! I had never encountered the findInterval function before. – Ian Fellows May 19 '11 at 03:07
  • 3
    @Ian What reminds me about http://unknownr.r-forge.r-project.org/. From author description: "Do you know how many functions there are in base R? How many of them do you know you don't know? Run `unk()` to discover your unknown unknowns. It's fast and it's fun!" – Marek May 19 '11 at 08:22
1

I'm not meaning to be facetious but these are the types of problems that R is designed to solve without having to do every single calculation - ie, use statistics!

Assuming that the distributions are normal...

aa.new <- sample(aa, 1000)
bb.new <- sample(bb, 1000)

x <- lapply(aa.new, function(z){mean(bb.new<pnorm(z))})
x <- unlist(x)

mean(x)

You can be 99% certain that the proportion of bb < aa[i] falls between +/- 4% of mean(x).

For simple random sampling, 99% margin of error = 1.29/sqrt(n)

Brandon Bertelsen
  • 43,807
  • 34
  • 160
  • 255
1

If you only want the proportion ' < aa[i]' then you should just determine the number of bb less than than each value of aa and then divide by length:

bbs <- sort(bb)
zz <- findInterval(aa, bbs)
zz <- zz/length(aa)

It does what you say you want, while your code I fear does not.

IRTFM
  • 258,963
  • 21
  • 364
  • 487