18

I wrote the following function, it works. However it is very slow when df1 has 1700 rows, and df2 has 70000 rows. Is there anyway to improve the efficiency?

rowcheck <- function(df1, df2){
         apply(df1, 1, function(x) any(apply(df2, 1, function(y) all(y==x))))
}

An example I wrote this function to apply to is: I want to check whether each row in df1 is contained as a row in df2:

df1=data.frame(a=c(1:3),b=c("a","b","c"))
df2=data.frame(a=c(1:6),b=rep(c("a","b","c"),2))

For each row of df1, I want to check if it is contained as a row in df2. I want to return of the function to be a logical vector of length nrow(df1).

Thank you for your help.

Bruce Chen
  • 183
  • 1
  • 1
  • 5
  • 1
    Perhaps you could try `all(y %in% x)`? I suggest you provide a simple example that demonstrates your problem. http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example – Roman Luštrik Mar 26 '14 at 21:29
  • The **dplyr** package has a `setdiff` method for data.frame's that you might investigate. – joran Mar 26 '14 at 21:48
  • otherwise i would paste values line per line with u <- do.call(paste, df1) and v <- do.call(paste, df2) and then do u%in%v, but i don't check the computation time – droopy Mar 26 '14 at 21:54
  • @droopy u%in%v work great for one dimentional u, and a vector v where each element is also one dimentional. However, when u is a vector, say u=(x1,x2), and v is a matrix or data frame of m x 2 dimention, the %in% does not work correctly. Am I missing something with %in%? – Bruce Chen Mar 26 '14 at 22:07
  • yes because my solution is exactly the one that Richard copy and paste into a function – droopy Mar 26 '14 at 22:41

4 Answers4

22

One way is to paste the rows together, and compare them with %in%. The result is a logical vector the length of nrow(df1), as requested.

do.call(paste0, df1) %in% do.call(paste0, df2)
# [1] TRUE TRUE TRUE
Rich Scriven
  • 97,041
  • 11
  • 181
  • 245
  • Thanks @RichardScriven. So basically you are reducing the column dimension to 1 so you could use %in%. So I guess %in% only works for one dimentional variable check. – Bruce Chen Mar 26 '14 at 22:34
  • This is such a perfect and clean solution to my problem! Thank you! – Carrol Apr 06 '18 at 15:47
7

Try:

Filter(function(x) x > 0, which(duplicated(rbind(df2, df1))) - nrow(df2))

It will tell you which row numbers in df1 occur in df2. If you want an atomic vector of logicals like in Richard Scriven's answer, try

duplicated(rbind(df2, df1))[-seq_len(nrow(df2))]

It is also faster since it uses an internal C function duplicated (mine is rowcheck2)

> microbenchmark(rowcheck(df1, df2), rowcheck2(df1, df2))
 Unit: milliseconds
                expr      min       lq   median       uq       max neval
  rowcheck(df1, df2) 2.045210 2.169182 2.328296 3.539328 13.971517   100
  rowcheck2(df1, df2) 1.046207 1.112395 1.243390 1.727921  7.442499   100
Robert Krzyzanowski
  • 9,294
  • 28
  • 24
  • If I am not wrong, will this not fail if there are duplicate values in `df1`? The OP does not seems to make any indication that `df1` will not have duplicates. – Benjamin Christoffersen Jan 07 '21 at 07:24
1

Just wanted to give my two cents on this query. A plyr-based solution:

nrow(match_df(df2, df1))

.. Will check each row of df1 against df2 (with respect to all columns) and give you the number of rows of df1 contained in df2.

Mak
  • 73
  • 7
0

A solution based on merge is:

# simulate data
options(stringsAsFactors = FALSE)
set.seed(1)
n1 <- 400L
n2 <- 1000L
df1 <- data.frame(a = sample.int(20L, n1, TRUE) ,
                  b = sample(letters, n1, TRUE))
df2 <- data.frame(a = sample.int(20L, n2, TRUE),
                  b = sample(letters, n2, TRUE))
df2 <- df2[!duplicated(df2), ]

# the new function
row_check_new <- function(x, y){
  # are there columns in x that are not in y or vice versa?
  if(length(union(colnames(x), colnames(y))) > length(colnames(x)))
    return(logical(NROW(x)))
  dum <- transform(x, row_id_dummy = 1:NROW(x))
  dum$row_id_dummy %in% merge(dum, y)$row_id_dummy
}

# it yields the same
rowcheck <- function(df1, df2)
  apply(df1, 1, function(x) any(apply(df2, 1, function(y) all(y==x))))

all.equal(rowcheck(df1, df2), row_check_new(df1, df2))
#R> [1] TRUE

# but is much faster
bench::mark(old = rowcheck(df1, df2), new = row_check_new(df1, df2))
#R> # A tibble: 2 x 13
#R>   expression      min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc
#R>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>
#R> 1 old        322.56ms 327.26ms      3.06    11.4MB    18.3      2    12
#R> 2 new          1.25ms   1.31ms    736.     222.8KB     6.00   368     3

This works with duplicates in df1. The solution by Rich Scriven is faster. There are some corner cases where the solution based on merge is preferable as Rich Scriven's solution will give an incorrect answer. For instance, consider the following example with integers

df1 <- data.frame(x1 = 11, x2 = 1)
df2 <- data.frame(x1 = 1, x2 = 11)

do.call(paste0, df1) %in% do.call(paste0, df2)
#R> [1] TRUE
rowcheck(df1, df2)
#R> [1] FALSE
row_check_new(df1, df2)
#R> [1] FALSE