If you are eager to get the answer immediately, jump to Conclusion. I offer you a single line R code, with maximum efficiency. For details/ideas, read through the following.
Code re-shaping and problem re-definition
When OP asks a vectorization of the following loop:
for(j in 1:N) A[j, 1] <- B[max(which(C[j] >= D))]
The first thing I do is to transform it into a nice version:
## stage 1: index computation (need vectorization)
id <- integer(N); for(j in 1:N) id[j] <- max(which(D <= C[j]))
## stage 2: shuffling (readily vectorized)
A[, 1] <- B[id]
Now we see that only stage 1 needs be vectorized. This stage essentially does the following:
D[1] D[2] D[3] ... D[M]
C[1]
C[2]
C[3]
.
.
C[N]
For each row j
, find the cut off location k(j)
in D
, such that D[k(j) + 1], D[k(j) + 2], ..., D[M] > C[j]
.
Efficient algorithm based on sorting
There is actually an efficient algorithm to do this:
- sort
C
in ascending order, into CC
(record ordering index iC
, such that C[iC] == CC
)
- sort
D
in ascending order, into DD
(record ordering index iD
, such that D[iD] == DD
)
By sorting, we substantially reduce the work complexity.
If data are unsorted, then we have to explicitly scan all elements: D[1], D[2], ..., D[M]
in order to decide on k(j)
. So there is O(M)
costs for each row, thus O(MN)
costs in total.
However, If data are sorted, then we only need to do the following:
j = 1: search `D[1], D[2], ..., D[k(1)]`, till `D[k(1) + 1] > C[1]`;
j = 2: search `D[k(1) + 1], D[k(1)+2], ..., D[k(2)]`, till `D[k(2) + 1] > C[2]`;
...
For each row, only partial searching is applied, and the overall complexity is only O(M)
, i.e., D
vector is only touched once, rather than N
times as in the trivial implementation. As a result, after sorting, the algorithm is N
times faster!! For large M
and N
, this is a huge difference! As you said in other comment, this code will be called millions of times, then we definitely want O(M)
algorithm instead of O(MN)
algorithm.
Also note, that the memory costs for this approach is O(M + N)
, i.e., we only concatenate two vectors together, rather than expanding it into an M-by-N
matrix. So such storage saving is also noticeable.
In fact, we can take one step further, by converting this comparison problem into a matching problem, which is easier to vectorize in R.
## version 1:
CCDD <- c(CC, DD) ## combine CC and DD
CCDD <- sort(CCDD, decreasing = TRUE) ## sort into descending order
id0 <- M + N - match(CC, CCDD) + 1
id <- id0 - 1:N
To understand why this work, consider an alternative representation:
## version 2:
CCDD <- c(CC, DD) ## combine CC and DD
CCDD <- sort(CCDD) ## sort into ascending order
id0 <- match(CC, CCDD)
id <- id0 - 1:N
Now the following diagram illustrates what CCDD
vector looks like:
CCDD: D[1] D[2] C[1] D[3] C[2] C[3] D[4] D[5] D[6] C[4] .....
id0: 3 5 6 10 .....
id : 2 3 3 6 .....
So, CCDD[id]
gives: D[2], D[3], D[3], D[6], ....
, exactly the last element no greater than C[1], C[2]. C[3], C[4], ....
, Therefore, id
is just the index we want!
Then people may wonder why I suggest doing "version 1" rather than "version 2". Because when there are tied values in CCDD
, "version 2" will give wrong result, because match()
will take the first element that matches, ignoring later matches. So instead of matching from left to right (in ascending index), we have to match from right to left (in descending index).
Using OP's data
With this in mind, I start looking at OP's data. Now amazingly, OP's data are already sorted:
C <- c(0.10607, 0.14705, 0.43607, 0.56587, 0.76203, 0.95657, 1.03524, 1.22956, 1.39074, 2.36452)
D <- c(0.10607, 0.13980, 0.14571, 0.14705, 0.29412, 0.33693, 0.43607, 0.53968, 0.56587, 0.58848,
0.64189, 0.65475, 0.75518, 0.76203, 0.95657, 1.03524, 1.05454, 1.18164, 1.22956, 1.23760,
1.39074, 1.87604, 2.36452, 2.89497, 4.42393)
M <- length(D); N <- length(C)
is.unsorted(C)
# FALSE
is.unsorted(D)
#FALSE
Furthermore, OP has already combined C
and D
:
all(C %in% D)
# TRUE
It seems that OP and I have the same idea on efficiency in mind. Presumably OP once had a shorter D
vector, while the D
vector he supplied is really the CCDD
vector I mentioned above!
Now, in this situation, things are all the way simple: we just do a single line:
id <- M - match(C, rev(D)) + 1
Note I put rev()
because OP has sorted D
in ascending order so I need to reverse it. This single line may look very much different from the "version 1" code, but nothing is wrong here. Remember, The D
used here is really the CCDD
in "version 1" code, and the M
here is really the M + N
there. Also, there is no need to subtract 1:N
from id
, due to our different definition of D
.
Checking result
Now, the trivial R-loop gives:
id <- integer(N); for(j in 1:N) id[j] <- max(which(D <= C[j]))
id
# [1] 1 4 7 9 14 15 16 19 21 23
Well, our single line, vectorized code gives:
id <- M - match(C, rev(D)) + 1
id
# [1] 1 4 7 9 14 15 16 19 21 23
Perfect match, hence we are doing the right thing.
Conclusion
So, Laurbert, this is the answer you want:
A[, 1] <- B[M - match(C, rev(D)) + 1]