Let's compare 3 approaches. The first is the for
loop in your question. A second is the one Ronak answered, using ifelse()
which makes the operation faster. However, the operator %in%
itself is somewhat slow, so if performance is really a concern you can get an even faster solution using indexing with names.
For example, using the words
dataset in the package stringr
:
library(stringr)
DF1 <- data.frame(column1 = sample(words, 700))
DF2 <- data.frame(column1 = sample(words, 700))
We can compare these methods:
bench::mark(for_loop={
res1 <- character(nrow(DF1))
for (i in seq_len(nrow(DF1))){
if (DF1$column1[i] %in% DF2$column1){
res1[i] <- "YES"
}
}
res1
},
ifelse = {
res2 <- ifelse(DF1$column1 %in% DF2$column1, "YES", "")
},
by_names = {
res3 <- setNames(rep("", nrow(DF1)),
DF1$column1)
res3[intersect(DF1$column1, DF2$column1)] <- "Yes"
},check = FALSE)
# A tibble: 3 x 13
# expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc
# <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl>
# 1 for_loop 7.94ms 8.49ms 112. 3.8MB 8.98 50 4
# 2 ifelse 200.6us 214.8us 4494. 60.7KB 4.09 2197 2
# 3 by_names 73.4us 78.2us 12342. 73.9KB 17.5 5628 8
# ... with 5 more variables: total_time <bch:tm>, result <list>, memory <list>,
# time <list>, gc <list>
As you can see, the ifelse
method is 40x faster than the for
loop, and indexing by name is 3x faster than with ifelse
.
If the ifelse
method is fast enough, you should use it as it is easier to read, but if your dataset is too big, selecting by name can add some welcome performance.
NB: the three solutions do give the same result, but the third method has names, hence the check=FALSE
argument.