Compare and identify the missing rows

Question

I would like to compare per row 2 df based on serial and day variables and to create a new column called compare to highlight the missing rows. How can this be done in R? I tried the inner_join function without success.

Sample structure df1 and df2

Desired output:

Sample data

df1<-structure(list(serial = c(1, 2, 3, 4, 5), day = c(1, 0, 1, 0, 
0)), class = c("spec_tbl_df", "tbl_df", "tbl", "data.frame"), row.names = c(NA, 
-5L), spec = structure(list(cols = list(serial = structure(list(), class = c("collector_double", 
"collector")), day = structure(list(), class = c("collector_double", 
"collector"))), default = structure(list(), class = c("collector_guess", 
"collector")), skip = 1L), class = "col_spec"))



df2<-structure(list(serial = c(1, 2, 3, 4, 5, 5, 7), day = c(1, 0, 
1, 0, 0, 1, 1)), class = c("spec_tbl_df", "tbl_df", "tbl", "data.frame"
), row.names = c(NA, -7L), spec = structure(list(cols = list(
    serial = structure(list(), class = c("collector_double", 
    "collector")), day = structure(list(), class = c("collector_double", 
    "collector"))), default = structure(list(), class = c("collector_guess", 
"collector")), skip = 1L), class = "col_spec"))

Related post: https://stackoverflow.com/q/3171426/680068 – zx8754 Jun 30 '21 at 13:15 — zx8754, Jun 30 '21 at 13:15

score 4 · Answer 1 · answered Jun 30 '21 at 16:57

We can use tidyverse

library(dplyr)
df2 %>%
    mutate(compare = TRUE) %>% 
   left_join(df1 %>%
      mutate(compare1 = TRUE), by = c('serial', 'day')) %>%
      transmute(serial, day, compare = (!is.na(compare1)))

-output

# A tibble: 7 x 3
  serial   day compare
   <dbl> <dbl> <lgl>  
1      1     1 TRUE   
2      2     0 TRUE   
3      3     1 TRUE   
4      4     0 TRUE   
5      5     0 TRUE   
6      5     1 FALSE  
7      7     1 FALSE

Or with a faster and efficient data.table

library(data.table)
setDT(df2)[, compare := FALSE][setDT(df1), compare := TRUE, on = .(serial, day)]

score 3 · Accepted Answer · answered Jun 30 '21 at 10:27

One way would be to create a unique key combining the two columns and use %in% to find if the key is present in another dataset.

A base R option -

df2$compare <- do.call(paste, df2) %in% do.call(paste, df1)
df2

# A tibble: 7 x 3
#  serial   day compare
#   <dbl> <dbl> <lgl>  
#1      1     1 TRUE   
#2      2     0 TRUE   
#3      3     1 TRUE   
#4      4     0 TRUE   
#5      5     0 TRUE   
#6      5     1 FALSE  
#7      7     1 FALSE

If there are more columns in your data apart from serial and day use the below code.

cols <- c('serial', 'day')
df2$compare <- do.call(paste, df2[cols]) %in% do.call(paste, df1[cols])

score 3 · Answer 3 · answered Jun 30 '21 at 17:04

A base R option

transform(
    merge(cbind(df1, compare = TRUE), df2, all = TRUE),
    compare = !is.na(compare)
)

gives

  serial day compare
1      1   1    TRUE
2      2   0    TRUE
3      3   1    TRUE
4      4   0    TRUE
5      5   0    TRUE
6      5   1   FALSE
7      7   1   FALSE

Compare and identify the missing rows

3 Answers3