R is written for vector/matrix operations. It allows but is not happy with for()
loops. Nested for()
loops take forever
I've read that pretty much all for()
loops can be turned into proper vector operations, but for the life of me, I can't figure out how to do it in this simple case:
I have two data tables, dt_a and dt_b of different lengths (dt_a
: 1408 rows & dt_b
: 2689 rows), with columns dt_a$x
, dt_b$y
, and dt_b$z
. I want to search for matches of any value in of column dt_a$x
in each value of dt_b$y
and if they match, set dt_b$z <- dt_a$x
. If there's no match, set it to "NOMATCH".
This is a programming 101 operation with for loops:
for (i in 1:2689) {
for (j in 1:1408) {
if (grepl(dt_a$x[j], dt_b$y[i], ignore.case=TRUE, perl=TRUE)) {
dt_b$z[i] <- dt_a$x[j];
break;
}
dt_a$z[i] <- "NOMATCH";
}
}
However, this operation takes more than 6 minutes to run, iterating through all the loops. I'll soon need to adapt it to a much larger data set, so the order of magnitude time increases will not be viable.
What's the correct way to do this nested for()
loop operation using proper R vector operations?
Thanks!
Update
The answer by @nickk vectorizes one of the loops making the nesting unecessary and reducing the execution by an order of magnitude. I've credited it as most useful answer because I was able to get it to work in my code. The answers provided by @deanmacgregor were very useful in helping me understand more about what is going on. I couldn't get them to run in my code, but that's probably my fault for not understanding something. The cross-join approach, in particular, is probably the best solution. I need more practice in order to make it work with my data, but I don't want to wait too long before resolving this question.
Additional thanks to @romantsegelskyi for teaching me proper question formatting, and to @pierrelafortune and @brodieG for teaching me the importance and content of reproducible questions. ^_^
I've credited you all in my source code which will (someday) be released as open source.