Vectorize R command (part 2)

Question

Yesterday I asked a very simple vectorization question and got some great answers. Today the question is a bit more complex and I'm wondering if R has a function to speed up the runtime of this loop through vectorization.

The loop is

for(j in 1:N) {
    A[j,1] = B[max(which(C[j]>=D))];
}

I tried

A[,1] = B[max(which(C>=D))];

and this dropped the runtime considerably ... but the answer was wrong. Is there a "correct" way to do this in R?

EDIT1:

Thanks for the questions regarding data. I will give sizes of the arrays here:

We are looping over 1:N

 A is N x 1
 B is length M
 C is length N
 D is length M

If it matters in terms of speed, in this example, N = 844, M = 2500.

Edit2:

And here are some values for a smaller simulated dataset:

B <- c(1.0000000, 1.0000000, 1.0000000, 0.9565217, 0.9565217, 0.9565217, 0.9565217,
0.9565217, 0.9565217, 0.9565217, 0.8967391, 0.8369565, 0.7771739, 0.7173913,
0.7173913, 0.7173913, 0.7173913, 0.7173913, 0.6277174, 0.6277174, 0.5230978,
0.5230978, 0.3923234, 0.3923234, 0.3923234)
C <- c(0.10607, 0.14705, 0.43607, 0.56587, 0.76203, 0.95657, 1.03524, 1.22956, 1.39074, 2.36452)
D <- c(0.10607, 0.13980, 0.14571, 0.14705, 0.29412, 0.33693, 0.43607, 0.53968, 0.56587,
0.58848, 0.64189, 0.65475, 0.75518, 0.76203, 0.95657, 1.03524, 1.05454, 1.18164,
1.22956, 1.23760, 1.39074, 1.87604, 2.36452, 2.89497, 4.42393)

The result should be:

 > A
           [,1]
 [1,] 1.0000000
 [2,] 0.9565217
 [3,] 0.9565217
 [4,] 0.9565217
 [5,] 0.7173913
 [6,] 0.7173913
 [7,] 0.7173913
 [8,] 0.6277174
 [9,] 0.5230978
[10,] 0.3923234

I second the comment about an example... it's not clear what you intend to do. Presumably `C[j]` and `D` are scalars, in which case a single comparison is made. So your `which` (and therefore the `max`) will always be 1... — Gabe, May 20 '16 at 11:50
Trying to read between the lines, I have the inkling that what you want is `A[,1] = B[pmax(C,D)]`. Since I cant edit my previous comment anymore, I'll clarify that it'll always be 1 if TRUE, otherwise it'll be the empty set. — Gabe, May 20 '16 at 11:58
So, what do you want `C[j]>D` to indicate? It seems that these are two vectors of different length. Do you mean `C[j]>D[j]`? In any case, Gabe is right that your construct with `max` and `which` is troublesome. — coffeinjunky, May 20 '16 at 12:26
To follow up on coffeeinjunk's comment, when you compare mismatched vectors, R will recycle values from the shortest one to get them to match. Chances are that even _if_ your loop is giving the answer you want, it might just be by coincidence since I think it's unlikely you actually _want_ the values to be recycled that way (meaning it may not be correct with a different set of values). — Gabe, May 20 '16 at 13:18

Roland · Answer 1 · 2016-05-20T13:35:21.353

You can use outer for this.

Your code:

A1 <- matrix(NA_real_, ncol = 1, nrow = length(C))
for(j in seq_along(C)) {
  A1[j,1] = B[max(which(C[j]>=D))];
}

Test if the elements of C are larger/equal the elements of D with outer:

test <- outer(C, D, FUN = ">=")
#      [,1]  [,2]  [,3]  [,4]  [,5]  [,6]  [,7]  [,8]  [,9] [,10] [,11] [,12] [,13] [,14] [,15] [,16] [,17] [,18] [,19] [,20] [,21] [,22] [,23] [,24] [,25]
# [1,] TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
# [2,] TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
# [3,] TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
# [4,] TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
# [5,] TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
# [6,] TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
# [7,] TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
# [8,] TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE
# [9,] TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE
#[10,] TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE

Note that this can use a lot of memory for large vectors.

Then find the last TRUE value in each row:

ind <- max.col(test, ties.method = "last") * (rowSums(test) > 0)

rowSums(test) > 0 tests if there are any TRUE values and makes the corresponding element of ind 0 otherwise. It's undefined what you'd want to happen in this case. (A 0 index is ignored during subsetting. Possibly, you'd want NA instead in your final result?)

Now subset:

A2 <- as.matrix(B[ind], ncol = 1)
#           [,1]
# [1,] 1.0000000
# [2,] 0.9565217
# [3,] 0.9565217
# [4,] 0.9565217
# [5,] 0.7173913
# [6,] 0.7173913
# [7,] 0.7173913
# [8,] 0.6277174
# [9,] 0.5230978
#[10,] 0.3923234

Are the results identical?

identical(A2, A1)
#[1] TRUE

The data (please use dput next time to provide example data):

B <- c(1.0000000, 1.0000000, 1.0000000, 0.9565217, 0.9565217, 0.9565217, 0.9565217,
0.9565217, 0.9565217, 0.9565217, 0.8967391, 0.8369565, 0.7771739, 0.7173913,
0.7173913, 0.7173913, 0.7173913, 0.7173913, 0.6277174, 0.6277174, 0.5230978,
0.5230978, 0.3923234, 0.3923234, 0.3923234)
C <- c(0.10607, 0.14705, 0.43607, 0.56587, 0.76203, 0.95657, 1.03524, 1.22956, 1.39074,
2.36452)
D <- c(0.10607, 0.13980, 0.14571, 0.14705, 0.29412, 0.33693, 0.43607, 0.53968, 0.56587,
0.58848, 0.64189, 0.65475, 0.75518, 0.76203, 0.95657, 1.03524, 1.05454, 1.18164,
1.22956, 1.23760, 1.39074, 1.87604, 2.36452, 2.89497, 4.42393)

Gabe · Answer 2 · 2016-05-22T02:27:44.313

After seeing @Roland's answer, I think I understand better what you are asking. To double check: you want to compare each value of C (individually) against all values of D, and get the largest index of D (let's call it k) that holds a value smaller than C[j]. You then want to use it to assign the corresponding value of B to A, thus A[j]=B[k]. Is this correct?

I don't have an answer regarding how to vectorize what you want to do, but do have some suggestions on how to speed it up. Before that, let me ask whether it's actually worth going through the effort. For the larger example you mentioned (N~1000, M~2500), your loop still runs in well under a second on my laptop. Unless this calculation is done many times over inside another loop, it seems like unnecessary optimization...

Also, like @Roland pointed out, it's not clear what should happen if there is a value in C that's smaller than all values in D. These functions (including your original loop) will not work if that happens and would need some slight tweaking.

Anyway, these are my suggestions:

First, let me wrap your loop into a function for convenience.

f_loop <- function(B, C, D){
  N <- length(C)
  A <- matrix(0, ncol=1, nrow=N)
  for(j in 1:N) {
    A[j,1] = B[max(which(C[j]>=D))]
  }
  return(A)
}

If you want it to look a bit more "R-like" you can replace the loop with one of the *apply functions. In this case, it also runs slightly faster than the loop.

vapply(C, function(x) B[max(which(x>=D))], 0)

## Wrapped into a function for easier reference
f_vapply <- function(B, C, D){
  vapply(C, function(x) B[max(which(x>=D))], 0)
}

My other suggestion is uglier (and not really "R-like"), but can help speed things up a lot (if that's the end goal here). I used the inline package to create a compiled version of your loop (note that depending on your OS and R setup, you may need to download additional tools or packages to be able to compile code).

## Translate loop into Fortran
loopcode <-
"  integer i, j, k
   do i = 1, n
      k = 0
      do j = 1, m
         if (C(i) >= D(j)) k = j
      end do
      A(i) = B(k)
   end do
"
## Compile into function
library(inline)
loopfun <- cfunction(sig = signature(A="numeric", B="numeric", C="numeric", D="numeric", n="integer", m="integer"), dim=c("(n)", "(m)", "(n)", "(m)", "", ""), loopcode, language="F95")

## Wrap into function for easier reference
f_compiled <- function(B, C, D){
  A <- C
  n <- length(A)
  m <- length(B)
  out <- loopfun(A, B, C, D, n, m)
  return(as.matrix(out$A, ncol=1))
}

Let's check that the results all match:

cbind(A, f_loop(B, C, D), f_vapply(B, C, D), f_compiled(B, C, D))
##            [,1]      [,2]      [,3]      [,4]
##  [1,] 1.0000000 1.0000000 1.0000000 1.0000000
##  [2,] 0.9565217 0.9565217 0.9565217 0.9565217
##  [3,] 0.9565217 0.9565217 0.9565217 0.9565217
##  [4,] 0.9565217 0.9565217 0.9565217 0.9565217
##  [5,] 0.7173913 0.7173913 0.7173913 0.7173913
##  [6,] 0.7173913 0.7173913 0.7173913 0.7173913
##  [7,] 0.7173913 0.7173913 0.7173913 0.7173913
##  [8,] 0.6277174 0.6277174 0.6277174 0.6277174
##  [9,] 0.5230978 0.5230978 0.5230978 0.5230978
## [10,] 0.3923234 0.3923234 0.3923234 0.3923234

And check the speed:

microbenchmark(f_loop(B, C, D), f_vapply(B, C, D), f_compiled(B, C, D))
## Unit: microseconds
##                 expr    min      lq     mean  median      uq    max neval cld
##      f_loop(B, C, D) 52.804 54.8075 57.34588 56.5420 58.4615 83.843   100   c
##    f_vapply(B, C, D) 38.677 41.5055 43.21231 42.8825 44.1525 65.355   100  b 
##  f_compiled(B, C, D) 17.095 18.2775 20.55372 20.1770 21.4710 66.407   100 a

We can also try it with vectors of similar size to the larger ones you mentioned (note the change in units for the results):

## Make the vector larger for benchmark
B <- rep(B, 100)  # M = 2500
C <- rep(C, 100)  # N = 1000
D <- rep(D, 100)  # M = 2500

microbenchmark(f_loop(B, C, D), f_vapply(B, C, D), f_compiled(B, C, D))
## Unit: milliseconds
##                 expr       min       lq     mean    median        uq      max neval cld
##      f_loop(B, C, D) 24.380069 24.85061 25.99855 25.839282 25.952433 62.75721   100   b
##    f_vapply(B, C, D) 23.543749 24.18427 25.34881 25.015859 25.179924 62.60746   100   b
##  f_compiled(B, C, D)  1.976611  2.01403  2.06750  2.032864  2.057594  3.13658   100  a

EDIT:

I realized that if you always want the largest index of D for which C[j]>=D holds, of course it makes much more sense to loop through D starting from the end of the array, and exiting as soon as the first instance is found (instead of looping through the full array).

This is a small tweak to the Fortran code I wrote above that takes advantage of that.

loopcode <-
"  integer i, j, k
   do j = 1, n
      k = 0
      do i = m, 1, -1
         if (C(j) >= D(i)) then
            k = i
            exit
         end if
      end do
      A(j) = B(k)
   end do
"

I won't include it in the benchmarks, because it'll be much more dependent on the actual data points. But it is obvious that worst case behavior is the same as the previous loop (e.g. if the index of interest occurs at the beginning, D is looped through in full) and the best case behavior almost completely eliminates looping through D (e.g. if the condition holds at the end of the array).

Interesting. For what it's worth, yes, this is worth optimizing. It will be called millions upon millions (maybe billions) of times. 95% of the runtime in the entire code is already written (manually) in fortran but this section is not. — drjrm3, May 20 '16 at 23:02
Got it, then yes it certainly seems worth it. Do you always want the largest index of `D` where `C[j]>=D` holds? If that's the case, I made a small tweak to the Fortran loop that can speed it up significantly. — Gabe, May 22 '16 at 02:31

Zheyuan Li · Accepted Answer · 2016-05-21T02:59:49.070

If you are eager to get the answer immediately, jump to Conclusion. I offer you a single line R code, with maximum efficiency. For details/ideas, read through the following.

Code re-shaping and problem re-definition

When OP asks a vectorization of the following loop:

for(j in 1:N) A[j, 1] <- B[max(which(C[j] >= D))]

The first thing I do is to transform it into a nice version:

## stage 1: index computation (need vectorization)
id <- integer(N); for(j in 1:N) id[j] <- max(which(D <= C[j]))
## stage 2: shuffling (readily vectorized)
A[, 1] <- B[id]

Now we see that only stage 1 needs be vectorized. This stage essentially does the following:

        D[1]  D[2]  D[3]  ...  D[M]
C[1]
C[2]
C[3]
 .
 .
C[N]

For each row j, find the cut off location k(j) in D, such that D[k(j) + 1], D[k(j) + 2], ..., D[M] > C[j].

Efficient algorithm based on sorting

There is actually an efficient algorithm to do this:

sort C in ascending order, into CC (record ordering index iC, such that C[iC] == CC)
sort D in ascending order, into DD (record ordering index iD, such that D[iD] == DD)

By sorting, we substantially reduce the work complexity.

If data are unsorted, then we have to explicitly scan all elements: D[1], D[2], ..., D[M] in order to decide on k(j). So there is O(M) costs for each row, thus O(MN) costs in total.

However, If data are sorted, then we only need to do the following:

j = 1: search `D[1], D[2], ..., D[k(1)]`, till `D[k(1) + 1] > C[1]`;
j = 2: search `D[k(1) + 1], D[k(1)+2], ..., D[k(2)]`, till `D[k(2) + 1] > C[2]`;
...

For each row, only partial searching is applied, and the overall complexity is only O(M), i.e., D vector is only touched once, rather than N times as in the trivial implementation. As a result, after sorting, the algorithm is N times faster!! For large M and N, this is a huge difference! As you said in other comment, this code will be called millions of times, then we definitely want O(M) algorithm instead of O(MN) algorithm.

Also note, that the memory costs for this approach is O(M + N), i.e., we only concatenate two vectors together, rather than expanding it into an M-by-N matrix. So such storage saving is also noticeable.

In fact, we can take one step further, by converting this comparison problem into a matching problem, which is easier to vectorize in R.

## version 1:
CCDD <- c(CC, DD) ## combine CC and DD
CCDD <- sort(CCDD, decreasing = TRUE)  ## sort into descending order
id0 <- M + N - match(CC, CCDD) + 1
id <- id0 - 1:N

To understand why this work, consider an alternative representation:

## version 2:
CCDD <- c(CC, DD)  ## combine CC and DD
CCDD <- sort(CCDD)  ## sort into ascending order
id0 <- match(CC, CCDD)
id <- id0 - 1:N

Now the following diagram illustrates what CCDD vector looks like:

CCDD:         D[1]  D[2]  C[1]  D[3]  C[2]  C[3]  D[4]  D[5]  D[6]  C[4] .....
 id0:                       3           5     6                       10 .....
 id :                       2           3     3                        6 .....

So, CCDD[id] gives: D[2], D[3], D[3], D[6], ...., exactly the last element no greater than C[1], C[2]. C[3], C[4], ...., Therefore, id is just the index we want!

Then people may wonder why I suggest doing "version 1" rather than "version 2". Because when there are tied values in CCDD, "version 2" will give wrong result, because match() will take the first element that matches, ignoring later matches. So instead of matching from left to right (in ascending index), we have to match from right to left (in descending index).

Using OP's data

With this in mind, I start looking at OP's data. Now amazingly, OP's data are already sorted:

C <- c(0.10607, 0.14705, 0.43607, 0.56587, 0.76203, 0.95657, 1.03524, 1.22956, 1.39074, 2.36452)
D <- c(0.10607, 0.13980, 0.14571, 0.14705, 0.29412, 0.33693, 0.43607, 0.53968, 0.56587, 0.58848,
       0.64189, 0.65475, 0.75518, 0.76203, 0.95657, 1.03524, 1.05454, 1.18164, 1.22956, 1.23760,
       1.39074, 1.87604, 2.36452, 2.89497, 4.42393)
M <- length(D); N <- length(C)

is.unsorted(C)
# FALSE

is.unsorted(D)
#FALSE

Furthermore, OP has already combined C and D:

all(C %in% D)
# TRUE

It seems that OP and I have the same idea on efficiency in mind. Presumably OP once had a shorter D vector, while the D vector he supplied is really the CCDD vector I mentioned above!

Now, in this situation, things are all the way simple: we just do a single line:

id <- M - match(C, rev(D)) + 1

Note I put rev() because OP has sorted D in ascending order so I need to reverse it. This single line may look very much different from the "version 1" code, but nothing is wrong here. Remember, The D used here is really the CCDD in "version 1" code, and the M here is really the M + N there. Also, there is no need to subtract 1:N from id, due to our different definition of D.

Checking result

Now, the trivial R-loop gives:

id <- integer(N); for(j in 1:N) id[j] <- max(which(D <= C[j]))
id
# [1]  1  4  7  9 14 15 16 19 21 23

Well, our single line, vectorized code gives:

id <- M - match(C, rev(D)) + 1
id
# [1]  1  4  7  9 14 15 16 19 21 23

Perfect match, hence we are doing the right thing.

Conclusion

So, Laurbert, this is the answer you want:

A[, 1] <- B[M - match(C, rev(D)) + 1]