A solution based on merge
is:
# simulate data
options(stringsAsFactors = FALSE)
set.seed(1)
n1 <- 400L
n2 <- 1000L
df1 <- data.frame(a = sample.int(20L, n1, TRUE) ,
b = sample(letters, n1, TRUE))
df2 <- data.frame(a = sample.int(20L, n2, TRUE),
b = sample(letters, n2, TRUE))
df2 <- df2[!duplicated(df2), ]
# the new function
row_check_new <- function(x, y){
# are there columns in x that are not in y or vice versa?
if(length(union(colnames(x), colnames(y))) > length(colnames(x)))
return(logical(NROW(x)))
dum <- transform(x, row_id_dummy = 1:NROW(x))
dum$row_id_dummy %in% merge(dum, y)$row_id_dummy
}
# it yields the same
rowcheck <- function(df1, df2)
apply(df1, 1, function(x) any(apply(df2, 1, function(y) all(y==x))))
all.equal(rowcheck(df1, df2), row_check_new(df1, df2))
#R> [1] TRUE
# but is much faster
bench::mark(old = rowcheck(df1, df2), new = row_check_new(df1, df2))
#R> # A tibble: 2 x 13
#R> expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc
#R> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl>
#R> 1 old 322.56ms 327.26ms 3.06 11.4MB 18.3 2 12
#R> 2 new 1.25ms 1.31ms 736. 222.8KB 6.00 368 3
This works with duplicates in df1
. The solution by Rich Scriven is faster. There are some corner cases where the solution based on merge
is preferable as Rich Scriven's solution will give an incorrect answer. For instance, consider the following example with integers
df1 <- data.frame(x1 = 11, x2 = 1)
df2 <- data.frame(x1 = 1, x2 = 11)
do.call(paste0, df1) %in% do.call(paste0, df2)
#R> [1] TRUE
rowcheck(df1, df2)
#R> [1] FALSE
row_check_new(df1, df2)
#R> [1] FALSE