Join data but ignore missing values

Question

I am having some trouble with joining data frames with dplyr, where I would like to ignore the NAs.

The data that I have is quite big, but a simplified version looks like:

id <- c("id1", "id2", "id3", "id4")
A <- c("E", "F", "G", NA)
B <- c("T", NA, "N", "T")
C <- c(NA, "T", "U", NA)

df <- data.frame(A, B, C)

     id    A    B    C
1    id1   E    T    NA
2    id2   F    NA   T
3    id3   G    N    U
4    id4   NA   T    NA

I have an entry that I would like to match with df, which is e.g.:

df2 <- data.frame(A = "E", B = "T", C = "M")

    A    B    C
1   E    T    M

As a result I would like to obtain all rows from df that match with df2, but the NAs should be ignored. So the result should look like this:

     id    A    B    C
1    id1   E    T    NA
2    id4   NA   T    NA

I was trying to do this with semi_join, but it did not work so far:

result <- df %>%
  group_by(n = seq(n())) %>%
  do(modify_if(., is.na, ~NULL) %>%
       semi_join(df2, by = c("A", "B", "C"))) %>%
  ungroup %>%
  select(-n)

Which results in:

Error: `by` can't contain join column `C` which is missing from LHS
Call `rlang::last_error()` to see a backtrace

Who knows the answer?

Can you explain your desired results a bit more? Why would row `E T NA` and `NA T NA` be returned? — JasonAizkalns, Mar 29 '19 at 18:59
As others have said, it looks like your example is a bit messed up. In any case, `semi_join` is not the answer as it matches in the same way that the other dplyr joins work. You'll probably want to just make every combination (cartesian product) and then filter on at least one of the pairs of columns or no mismatches. — svenhalvorson, Mar 29 '19 at 19:03
@svenhalvorson sorry I do not get what you mean, could you provide a link or some more information? — PieterD, Mar 29 '19 at 19:11
There is almost certainly a better method than I have in mind but I would start by renaming all the columns in one data frame to A1, B2, C2... Then make a column called 'dummy' in each data frame that is the value 1. Merge them on that variable, make variables to check if `A == A1` ect. Finally, use an `apply` statement or something like that to check if any values in the row are `FALSE` and drop them. — svenhalvorson, Mar 29 '19 at 19:14
It looks like you are trying to capture all matching rows that might match df2. Where NA is treated as unknown. — Adam Sampson, Mar 29 '19 at 19:15

score 1 · Accepted Answer · answered Mar 29 '19 at 19:17

Here's a solution with a mix of tidyverse and base R. I think this is pretty clear, but I'd be interested in a pure tidyverse implementation that isn't completely contrived.

The idea is to first expand all entries in df and df2 and then filter through all the columns using a loop.

The data:

id <- c("id1", "id2", "id3", "id4")
A <- c("E", "F", "G", NA)
B <- c("T", NA, "N", "T")
C <- c(NA, "T", "U", NA)

df <- data.frame(id, A, B, C, stringsAsFactors = F) # Make sure to use strings not factors
df2 <- data.frame(A = "E", B = "T", C = "M", stringsAsFactors = F)

Code:

library(tidyr)
results <- crossing(df, df2)
select_columns <- c("A", "B", "C")
for(col in select_columns) {
  keep <- is.na(results[[col]]) | results[[col]] == results[[paste0(col, 1)]]
  results <- results[keep,, drop=F]
}
results <- results %>% dplyr::select(id, A:C) %>% distinct
results

   id    A B    C
1 id1    E T <NA>
2 id4 <NA> T <NA>

How would you do it if the data.frames were too big, so that the crossing() could not be done. Thanks. — TOMC, Feb 07 '23 at 16:52
You could break up df into small chunks, do the crossing in a loop and combine the results at the end. — thc, Feb 09 '23 at 18:27

score 1 · Answer 2 · answered Mar 29 '19 at 19:17

1

If you only need to do this for a single set of values this is probably the most straightforward approach:

d[A %in% c("E",NA) & B %in%c("T",NA) & C %in% c("M",NA),]

answered Mar 29 '19 at 19:17

Michael

5,808
4
30
39

score 0 · Answer 3 · answered Mar 29 '19 at 19:48

Another example using tidyverse and base (dplyr, tidyr, base):

In this I convert your df2 into a dataframe that includes all combinations of values you want to accept ( (E or NA) & (T or NA) & (M or NA) ) and then I do an inner join with this full set. There are other ways to create a dataframe of all possible combinations but this one uses tidyr fairly easily.

library(dplyr)
library(tidyr)

id <- c("id1", "id2", "id3", "id4")
A <- c("E", "F", "G", NA)
B <- c("T", NA, "N", "T")
C <- c(NA, "T", "U", NA)

df <- data.frame(A, B, C, stringsAsFactors = FALSE)

df2 <- data.frame(A = "E", B = "T", C = "M",stringsAsFactors = FALSE)

df2_expanded <- df2 %>%
  rowwise() %>%
  mutate(combinations = list(expand.grid(A = c(A,NA),B = c(B,NA),C = c(C,NA),stringsAsFactors = FALSE))) %>%
  select(-A,-B,-C) %>%
  unnest(combinations)

# A tibble: 8 x 3
#   A     B     C    
# <chr> <chr> <chr>
# 1 E     T     M    
# 2 NA    T     M    
# 3 E     NA    M    
# 4 NA    NA    M    
# 5 E     T     NA   
# 6 NA    T     NA   
# 7 E     NA    NA   
# 8 NA    NA    NA   

df %>%
  inner_join(df2_expanded)

#      A B    C
# 1    E T <NA>
# 2 <NA> T <NA>

Join data but ignore missing values

3 Answers3