1

I have the below code, where I am trying to go each and every row of certain column of a dataframe and check that value into another dataframe to insert a value in column4 of first dataframe

for (i in 1:length(DF$Date)){
  if (DF$column1[i] %in% DF_2$column_1){
    DF$column4[i] <- "YES"
  }
}

The code works fine, as I have millions of records it is taking huge time to perform the task.

It would be helpful if anyone has an efficient way to tackle this in short span

neilfws
  • 32,751
  • 5
  • 50
  • 63
Vin
  • 89
  • 5
  • 1
    It's easier to help you if you include a simple [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input and desired output that can be used to test and verify possible solutions. – MrFlick Dec 15 '20 at 04:19

3 Answers3

3

You don't need a for loop here since %in% is vectorised. A simple ifelse should work.

DF$column4 <- ifelse(DF$column1 %in% DF_2$column_1, 'YES', 'NO')

You could also do this in the following way :

DF$column4 <- c('NO', 'YES')[(DF$column1 %in% DF_2$column_1) + 1]

which might be faster on larger datasets.

Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
3

Let's compare 3 approaches. The first is the for loop in your question. A second is the one Ronak answered, using ifelse() which makes the operation faster. However, the operator %in% itself is somewhat slow, so if performance is really a concern you can get an even faster solution using indexing with names.

For example, using the words dataset in the package stringr:

library(stringr)

DF1 <- data.frame(column1 = sample(words, 700))
DF2 <- data.frame(column1 = sample(words, 700))

We can compare these methods:


bench::mark(for_loop={
  res1 <- character(nrow(DF1))
  for (i in seq_len(nrow(DF1))){
    if (DF1$column1[i] %in% DF2$column1){
      res1[i] <- "YES"
    }
  }
  res1
},
ifelse = {
  res2 <- ifelse(DF1$column1 %in% DF2$column1, "YES", "")
},
by_names = {
  res3 <- setNames(rep("", nrow(DF1)),
                   DF1$column1)
  res3[intersect(DF1$column1, DF2$column1)] <- "Yes"
},check = FALSE)

# A tibble: 3 x 13
#     expression      min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc
#     <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>
#   1 for_loop     7.94ms   8.49ms      112.     3.8MB     8.98    50     4
#   2 ifelse      200.6us  214.8us     4494.    60.7KB     4.09  2197     2
#   3 by_names     73.4us   78.2us    12342.    73.9KB    17.5   5628     8
#  ... with 5 more variables: total_time <bch:tm>, result <list>, memory <list>,
#    time <list>, gc <list>

As you can see, the ifelse method is 40x faster than the for loop, and indexing by name is 3x faster than with ifelse.

If the ifelse method is fast enough, you should use it as it is easier to read, but if your dataset is too big, selecting by name can add some welcome performance.

NB: the three solutions do give the same result, but the third method has names, hence the check=FALSE argument.

Alexlok
  • 2,999
  • 15
  • 20
1

You can increase the efficiency by using fastmatch:

library(fastmatch)
DF$column4 <- c("", "YES")[(DF$column1 %fin% DF_2$column1) + 1]

or:

DF$column4 <- ""
x[fmatch(DF_2$column1, DF$column1, 0)] <- "Yes"

Benchmark (using Data from @Alexlok):

library(stringr)
set.seed(42)
DF <- data.frame(column1 = sample(words, 700))
DF_2 <- data.frame(column1 = sample(words, 700))

microbenchmark::microbenchmark(control=list(order="block")
 , Vin = {x <- character(nrow(DF))
   for (i in 1:nrow(DF)) {
     if (DF$column1[i] %in% DF_2$column1) {
       x[i] <- "YES"
     }
   }
   x}
 , RonakShah1 = ifelse(DF$column1 %in% DF_2$column1, 'YES', 'NO')
 , RonakShah2 = c('NO', 'YES')[(DF$column1 %in% DF_2$column1) + 1]
 , Alexlok = {x <- setNames(rep("", nrow(DF)), DF$column1)
   x[intersect(DF$column1, DF_2$column1)] <- "Yes"
   x}
 , fmatch = {x <- character(nrow(DF))
   x[fmatch(DF_2$column1, DF$column1, 0)] <- "Yes"
   x}
 , fin = c('NO', 'YES')[(DF$column1 %fin% DF_2$column1) + 1]
 )
#Unit: microseconds
#       expr       min         lq        mean     median         uq       max
#        Vin 13782.353 20549.5995 21467.31764 21795.3050 22845.0755 36143.621
# RonakShah1   281.118   308.1385   322.57051   316.5560   340.8620   362.097
# RonakShah2    60.906    66.5565    68.17693    67.8925    69.4110    88.363
#    Alexlok   162.817   179.6205   184.22998   182.3755   187.8450   257.323
#     fmatch    27.949    29.0590    29.90183    29.5275    30.0465    49.612
#        fin    30.446    31.0340    31.86467    31.5135    31.9540    49.817
GKi
  • 37,245
  • 2
  • 26
  • 48