How to check if values in first dataframe are contained or match values in another dataframe

Question

I am using R to work with some dataframes. My issue is related on how to check if values in a variable in a first dataframe match with values in another dataframe. The match is very different to those like merge or join. I will introduce my dataframes (dput() at end):

My first dataframe is df1. It contains the variable name which I want to contrast with other variable in a second dataframe. It looks like this:

df1
                   name
1            JUAN GIRON
2            GINA OLEAS
3 JUAN FERNANDO ELIZAGA
4          MARCO TORRES
5   JUAN PABLO GONZALEZ
6            IRMA GOMEZ

The second dataframe is df2. It also contains a variable name which will be used to the contrast with name from df1. It looks like this (In a real situation df2 can be very large with more than 1000 rows):

df2
                      name val
1            JUANA MARQUEZ   1
2         FERNANDO ELIZAGA   2
3               IRMA GOMEZ   3
4           PABLO GONZALEZ   4
5               GINA LUCIO   5
6              MARK TORRES   6
7           LETICIA BLACIO   7
8 JUAN PABLO GIRON BELTRAN   8

I am looking for a way to check if every row of df1 for name variable is contained or match with any value for name in df2. For example, the value JUAN GIRON after checking with name from df2 should return, give a value of yes because it is contained in the string JUAN PABLO GIRON BELTRAN from df2. The same case would apply for the other values. In the end I would like to have something like this:

df3
                   name val
1            JUAN GIRON yes
2            GINA OLEAS  no
3 JUAN FERNANDO ELIZAGA yes
4          MARCO TORRES  no
5   JUAN PABLO GONZALEZ yes
6            IRMA GOMEZ yes

How can I reach that result? I have tried with grepl() concatenating the strings using | but it is not working because some values are returning a yes match when there is not match.

Also, as data can be large, I would like to have a solution with dplyr because the comparison is by row so it can be slow. Or any fast solution is welcome. Many thanks!

Data is next:

#df1
df1 <- structure(list(name = c("JUAN GIRON", "GINA OLEAS", "JUAN FERNANDO ELIZAGA", 
"MARCO TORRES", "JUAN PABLO GONZALEZ", "IRMA GOMEZ")), row.names = c(NA, 
-6L), class = "data.frame")

#df2
df2 <- structure(list(name = c("JUANA MARQUEZ", "FERNANDO ELIZAGA", 
"IRMA GOMEZ", "PABLO GONZALEZ", "GINA LUCIO", "MARK TORRES", 
"LETICIA BLACIO", "JUAN PABLO GIRON BELTRAN"), val = 1:8), row.names = c(NA, 
-8L), class = "data.frame")

Is the order important? For example, would you expect a yes for `GOMEZ IRMA`? — Andrew Gustar, Apr 11 '22 at 14:43
How specific do the matches need to be? For example, your output appears to have "JUAN PABLO GONZALEZ" matching "PABLO GONZALEZ," but should it also match "PABLO GONZALEZ BELTRAN" if that name is in the list? This is a place where humans are often better at ID'ing what should (and should not) count as a match. It is often difficult to translate our intuitive understanding of names into clear code for computers to follow — Mark Peterson, Apr 11 '22 at 17:24
@MarkPeterson Many thanks for your great comment. What you said is right, as we work with latino names. In the example you state, `JUAN PABLO GONZALEZ` should also match `PABLO GONZALEZ BELTRAN`. These names (`df2`) can have two given names and two family names, or one given name and two family names. So the code must be able to look up and check that the string in `df1` is in any of the names in the variable name of `df2`. Hope that is clearer, and if you want to post a solution you are welcome. The issue is that `df2` can have more than 1k rows so it needs to be fast. — Duck, Apr 11 '22 at 19:22
@Duck That clarification does complicate things a bit. Is it possible to split the data frames into two columns to separate out the given and family names? I'm assuming you want to call a match anytime both the given and family names have a match (and handle them the same). Further, if a `df1` given name is "JUAN PABLO" you want it to return a match for either "JUAN" or "PABLO" but presumably not "JUAN CARLOS," but a `df1` given name of "JUAN" should match any `df2` given name that includes "JUAN." Am I understanding correctly? — Mark Peterson, Apr 11 '22 at 19:40
As a further clarification of the issue: whether "JUAN PABLO GONZALEZ" should return a match to "JUAN XXX GONZALEZ" appears to depend on whether "XXX" is a given name (then no) or a family name (then yes). If that is the case, and the given and family names cannot be split into separate columns, then the code for this question is going to require parsing whether "XXX" is a given or family name. — Mark Peterson, Apr 11 '22 at 19:59
@MarkPeterson You are right, data on df1 and df2 could be splitted and when we have four words, we know two first are given and two second are family. We can also have only three words, where first is given and the two second are family. With that, I think it could be possible to separate. For more details, `JUAN PABLO GONZALEZ` has two given names and only one family name. In that case the match should be yes with `JUAN XXX GONZALEZ` if XXX equals `PABLO` or also if we remove `JUAN` and only keep `XXX GONZALEZ` then XXX should be `JUAN` or `PABLO` to get yes. — Duck, Apr 11 '22 at 21:15
@MarkPeterson The match should evaluate if any of the given or the family names can be found. Example, if in `df1` we have `JULIO MARK TORRES` and we check with all `df2` names, the value can be found in `MARK TORRES` from `df2` so we get yes. In this case if we split given `JULIO MARK` and family `TORRE` after the match, `MARK` is inside `MARK TORRES` and `TORRES` is also there but `JULIO` is not, so as the value has two matches we get yes. Let me know if that was clear enough and many thanks for your help! — Duck, Apr 11 '22 at 21:20
@Duck If there are 2 words `A B` then B is family, with 4 words `A B C D` then C and D is family. But when there are 3 words `A B C` C is family but how to know if B is family or given? — GKi, Apr 19 '22 at 07:08

ThomasIsCoding · Answer 1 · 2022-04-11T11:24:03.873

5

Perhaps we can do like this

df1 %>%
    mutate(val = c("no", "yes")[1 + (rowSums(
        outer(
            strsplit(name, "\\s+"),
            strsplit(df2$name, "\\s+"),
            Vectorize(function(x, y) all(x %in% y) | all(y %in% x))
        )
    ) > 0)])

which gives

                   name val
1            JUAN GIRON yes
2            GINA OLEAS  no
3 JUAN FERNANDO ELIZAGA yes
4          MARCO TORRES  no
5   JUAN PABLO GONZALEZ yes
6            IRMA GOMEZ yes

edited Apr 11 '22 at 11:24

answered Apr 11 '22 at 09:40

ThomasIsCoding

96,636
9
24
81

1

This is a nice solution as always yours are sir. Lets wait if somebody can release a new approach otherwise I will award the bounty to you! Many thanks! – Duck Apr 11 '22 at 15:14
1

@Duck Thanks! I would say my answer is not efficient, and I believe there must be a much better solution than mine, e.g., `fuzzyjoin` or something else (but I am not faminlar with those). So, I am also looking forward to a more efficient new approach. – ThomasIsCoding Apr 11 '22 at 20:47
1

many thanks this is a complex issue for my student and that is why I bountied the question with high score to solve it! – Duck Apr 11 '22 at 21:21

Skaqqs · Answer 2 · 2022-04-08T17:10:00.717

Here's an approach that uses a regex pattern and handles names that are either length 2 or 3. There's room for improvement, and I'd love to read other answers to this question.

# Input
a <- strsplit(df2$name, " ")
# Output
b <- c()

# Define regex pattern
for(i in 1:length(a)){
  if(length(a[[i]]) == 3){
    temp <- paste0(
        a[[i]][1], " ", a[[i]][2], "|",
        a[[i]][1], " ", a[[i]][3], "|",
        a[[i]][2], " ", a[[i]][3])
  } else if(length(a[[i]] == 2)){
    temp <- paste(a[[i]], collapse = " ")
  } else {
    stop("Length of split name was not 2 or 3")
  }
  b <- c(b, temp)
}

df1$val <- grepl(paste(b, collapse = "|"), df1$name)

Alternatively, after defining b using the loop above:

library(dplyr)
patt <- paste(b, collapse = "|")
df1 %>%
    mutate(val = grepl(patt, name))

Result:

> df1
                   name    val
1            JUAN GIRON    TRUE
2            GINA OLEAS   FALSE
3 JUAN FERNANDO ELIZAGA    TRUE
4          MARCO TORRES   FALSE
5   JUAN PABLO GONZALEZ    TRUE
6            IRMA GOMEZ    TRUE

This is a nice solution. Lets keep for others that could use `dplyr` +1 — user007, Apr 08 '22 at 16:28
Oh sorry I can upvote unless I have 15 rep points, but I will do ! — user007, Apr 08 '22 at 16:33
No worries. I'm not sure if there's any difference in performance (probably not), but you could apply this regex string in dplyr like this: `library(dplyr); patt <- paste(b, collapse = "|"); df1 %>% mutate(val = grepl(patt, name))` — Skaqqs, Apr 08 '22 at 16:44

shs · Answer 3 · 2022-04-12T14:11:42.787

The answer by ThomasIsCoding is great. But using outer() is quite memory consuming and not parallelizable. The following solution uses nested map()'s. Also, the furrr package is used to parallelize the outer map().

A benchmark with a much larger df2 shows that both nesting and parallelization yield a substantial speed-up for about double the speed in total.

Data and Packages

library(tidyverse)
library(furrr)

#df1
df1 <- structure(list(name = c("JUAN GIRON", "GINA OLEAS", "JUAN FERNANDO ELIZAGA", 
"MARCO TORRES", "JUAN PABLO GONZALEZ", "IRMA GOMEZ")), row.names = c(NA, 
-6L), class = "data.frame")

#df2
df2 <- structure(list(name = c("JUANA MARQUEZ", "FERNANDO ELIZAGA", 
"IRMA GOMEZ", "PABLO GONZALEZ", "GINA LUCIO", "MARK TORRES", 
"LETICIA BLACIO", "JUAN PABLO GIRON BELTRAN"), val = 1:8), row.names = c(NA, 
-8L), class = "data.frame")

Demo on small data set

plan(multisession, workers = 8) # 8 for my quad-core with hyperthreading

n2 <- df2$name |> 
  str_split("\\s+")

df1 |> 
  mutate(val = name |> 
           str_split("\\s+") |> 
           future_map_lgl(\(n1e) map_lgl(n2, 
                                 \(n2e) all(n1e %in% n2e) | all(n2e %in% n1e)
                                 ) |> any()
                      ) |> 
           factor(labels = c("no", "yes"))
         )
#>                    name val
#> 1            JUAN GIRON yes
#> 2            GINA OLEAS  no
#> 3 JUAN FERNANDO ELIZAGA yes
#> 4          MARCO TORRES  no
#> 5   JUAN PABLO GONZALEZ yes
#> 6            IRMA GOMEZ yes

Benchmark Code

# Make df2 much larger
df2xl <- df2 |> 
  list() |> 
  rep(10000) |> 
  bind_rows()

bench::mark(
  Thomas = df1 %>%
    mutate(val = c("no", "yes")[1 + (rowSums(
        outer(
            strsplit(name, "\\s+"),
            strsplit(df2xl$name, "\\s+"),
            Vectorize(function(x, y) all(x %in% y) | all(y %in% x))
        )
    ) > 0)]),
  nested_map = {
    n2 <- df2xl$name |> 
      str_split("\\s+")
    
    df1 |>
      mutate(val = name |>
               str_split("\\s+") |>
               map_lgl(\(n1e) map_lgl(n2,
                                     \(n2e) all(n1e %in% n2e) | all(n2e %in% n1e)
                                     ) |> any()
                          ) |>
               factor(labels = c("no", "yes"))
             )
    },
  parallel_nested_map = {
    n2 <- df2xl$name |> 
      str_split("\\s+")
    
    df1 |> 
      mutate(val = name |> 
               str_split("\\s+") |> 
               future_map_lgl(\(n1e) map_lgl(n2, 
                                     \(n2e) all(n1e %in% n2e) | all(n2e %in% n1e)
                                     ) |> any()
                          ) |> 
               factor(labels = c("no", "yes"))
             )
    },
  check = F,
  min_iterations = 10,
  filter_gc = F
)

Benchmark Result

#> # A tibble: 3 × 6
#>   expression               min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>          <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 Thomas                 2.48s    2.53s     0.396   21.51MB     8.31
#> 2 nested_map             1.72s    1.73s     0.563    2.46MB     8.28
#> 3 parallel_nested_map    1.07s    1.22s     0.827    2.86MB     2.56

^{Created on 2022-04-12 by the reprex package (v2.0.1)}

GKi · Answer 4 · 2022-04-25T13:19:18.167

In case the order of the names does not change .* could be inserted between the names and then use grep in both directions (df1$names in df2$names and df2$names in df1$names) and combine them with or |.

transform(df1, val = c("no", "yes")[1+(sapply(gsub(" +", ".*", df1$name),
       \(x) any(grep(x, df2$name)), USE.NAMES = FALSE) |
       Reduce(\(y, x) y | grepl(x, df1$name), gsub(" +", ".*", df2$name), FALSE))])
#                   name val
#1            JUAN GIRON yes
#2            GINA OLEAS  no
#3 JUAN FERNANDO ELIZAGA yes
#4          MARCO TORRES  no
#5   JUAN PABLO GONZALEZ yes
#6            IRMA GOMEZ yes

In case no partly match of the names are allowed surround the names with \b.

transform(df1, val = c("no", "yes")[1+(sapply(
                        gsub(" *(\\b[^ ]+\\b)", ".*\\\\b\\1\\\\b", df1$name),
                        \(x) any(grep(x, df2$name)), USE.NAMES = FALSE) |
   Reduce(\(y, x) y | grepl(x, df1$name),
     gsub(" *(\\b[^ ]+\\b)", ".*\\\\b\\1\\\\b", df2$name), FALSE))])

In case the order can chaange make a positive look ahead by placing the name in (?=.*NAMME) or also surround the name with \b (?=.*\\bNAME\\b).

transform(df1, val = c("no", "yes")[1+(sapply(
            gsub(" *(\\b[^ ]+\\b)", "(?=.*\\\\b\\1\\\\b)", df1$name),
         \(x) any(grep(x, df2$name, perl=TRUE)), USE.NAMES = FALSE) |
         Reduce(\(y, x) y | grepl(x, df1$name, perl=TRUE),
         gsub(" *(\\b[^ ]+\\b)", "(?=.*\\\\b\\1\\\\b)", df2$name), FALSE))] )

Its also possible to use agrepl and allow deletions which will be similar to the version assuming that the order of the names does not change and part matches of the name are allowed.

transform(df1, val = c("no", "yes")[1+(
  sapply(df1$name, \(x) any(agrepl(x, df2$name,
       list(cost=99, insertions=0, deletions=99, substitutions=0)))) |
  Reduce(\(y, x) y | agrepl(x, df1$name, list(cost=99, insertions=0,
       deletions=99, substitutions=0)), df2$name, FALSE))])

Another option can be the usage of look up tables:

s1 <- strsplit(df1$name, " ", TRUE)
lup1 <- list2env(split(rep(seq_along(s1), lengths(s1)), unlist(s1)))
s2 <- strsplit(df2$name, " ", TRUE)
lup2 <- list2env(split(rep(seq_along(s2), lengths(s2)), unlist(s2)))
`[<-`(sapply(s1, \(x) any(Reduce(intersect, mget(x, lup2, ifnotfound =
    list(NULL))))), unlist(lapply(s2, \(x) Reduce(intersect, mget(x, lup1,
       ifnotfound = list(NULL))))), TRUE)
#[1]  TRUE FALSE  TRUE FALSE  TRUE  TRUE

Benchmark:

Its also possible to limit the comparisons only to those which didn't have a match (GKi1b) where maybe the usage of indices using which instead of using the logical vector twice could fuhrer improve and making an exit of the loop in case all have a hit. In case the names are not unique use unique on the names.

library(dplyr)
bench::mark(
  Thomas = df1 %>%
    mutate(val = c("no", "yes")[1 + (rowSums(
        outer(
            strsplit(name, "\\s+"),
            strsplit(df2$name, "\\s+"),
            Vectorize(function(x, y) all(x %in% y) | all(y %in% x))
        )
    ) > 0)]),
  GKi1 = transform(df1, val = c("no", "yes")[1+(sapply(gsub(" +", ".*", df1$name),
       \(x) any(grep(x, df2$name)), USE.NAMES = FALSE) |
       Reduce(\(y, x) y | grepl(x, df1$name), gsub(" +", ".*", df2$name), FALSE))]),
  GKi1b = transform(df1, val = c("no", "yes")[1 +
    Reduce(\(i, x) `[<-`(i, !i, grepl(x, df1$name[!i])), gsub(" +", ".*",
      df2$name), sapply(gsub(" +", ".*", df1$name), \(x) any(grep(x, df2$name)),
        USE.NAMES = FALSE)) ]),
  GKi1c = transform(df1, val = c("no", "yes")[1+(sapply(gsub(" +", ".*", df1$name),
       \(x) any(grep(x, df2$name)), USE.NAMES = FALSE) |
       grepl(paste(gsub(" +", ".*", df2$name), collapse = "|"), df1$name) )]),
  GKi2 = transform(df1, val = c("no", "yes")[1+(sapply(
                        gsub(" *(\\b[^ ]+\\b)", ".*\\\\b\\1\\\\b", df1$name),
                        \(x) any(grep(x, df2$name)), USE.NAMES = FALSE) |
   Reduce(\(y, x) y | grepl(x, df1$name),
   gsub(" *(\\b[^ ]+\\b)", ".*\\\\b\\1\\\\b", df2$name), FALSE))]),
  GKi3 = transform(df1, val = c("no", "yes")[1+(sapply(
            gsub(" *(\\b[^ ]+\\b)", "(?=.*\\\\b\\1\\\\b)", df1$name),
         \(x) any(grep(x, df2$name, perl=TRUE)), USE.NAMES = FALSE) |
         Reduce(\(y, x) y | grepl(x, df1$name, perl=TRUE),
         gsub(" *(\\b[^ ]+\\b)", "(?=.*\\\\b\\1\\\\b)", df2$name), FALSE))] ),
  GKi4 = transform(df1, val = c("no", "yes")[1+(
  sapply(df1$name, \(x) any(agrepl(x, df2$name,
       list(cost=99, insertions=0, deletions=99, substitutions=0)))) |
  Reduce(\(y, x) y | agrepl(x, df1$name, list(cost=99, insertions=0,
                                              deletions=99, substitutions=0)), df2$name, FALSE))]),
  GKi5 = {
    s1 <- strsplit(df1$name, " ", TRUE)
    lup1 <- list2env(split(rep(seq_along(s1), lengths(s1)), unlist(s1)))
    s2 <- strsplit(df2$name, " ", TRUE)
    lup2 <- list2env(split(rep(seq_along(s2), lengths(s2)), unlist(s2)))
    transform(df1, val = c("no", "yes")[1+`[<-`(sapply(s1, \(x) any(Reduce(base::intersect, mget(x, lup2, ifnotfound =
    list(NULL))))), unlist(lapply(s2, \(x) Reduce(base::intersect, mget(x, lup1,
       ifnotfound = list(NULL))))), TRUE)])
  }
)

  expression      min median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time
  <bch:expr> <bch:tm> <bch:>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm>
1 Thomas        863µs  894µs      919.    4.08KB    21.4    429    10      467ms
2 GKi1          211µs  218µs     3770.        0B    14.6   1803     7      478ms
3 GKi1b         211µs  226µs     3020.        0B    14.6   1448     7      479ms
4 GKi1c         183µs  200µs     3424.        0B    10.3   1667     5      487ms
5 GKi2          262µs  275µs     2755.        0B    12.4   1336     6      485ms
6 GKi3          391µs  409µs     2010.        0B     9.19   875     4      435ms
7 GKi4          374µs  386µs     2295.        0B    16.5   1110     8      484ms
8 GKi5          272µs  285µs     2570.    2.82KB    21.1   1220    10      475ms

All variants are more than 2 times faster than ThomasIsCoding using one CPU-core.

score 2 · Answer 5 · answered Apr 14 '22 at 02:10

library(data.table)
library(stringi)
library(purrr)

setDT(df1)
df1[,val := fifelse(map_lgl(stri_replace_all_fixed(name, " ", "|"), ~any(stri_count_regex(..df2$name, .x) >= 2)), "yes", "no")][]
#>                     name val
#> 1:            JUAN GIRON yes
#> 2:            GINA OLEAS  no
#> 3: JUAN FERNANDO ELIZAGA yes
#> 4:          MARCO TORRES  no
#> 5:   JUAN PABLO GONZALEZ yes
#> 6:            IRMA GOMEZ yes

This creates a vector of partial match counts (needles) per name in df2 (haystack).

[[1]]
[1] 1 0 0 0 0 0 0 2

[[2]]
[1] 0 0 0 0 1 0 0 0

[[3]]
[1] 1 2 0 0 0 0 0 1

[[4]]
[1] 0 0 0 0 0 1 0 0

[[5]]
[1] 1 0 0 2 0 0 0 2

[[6]]
[1] 0 0 2 0 0 0 0 0

I've set the number of minimum matches that result in a "yes" to 2, but this number can be tweaked (given we don't know which part is a family name or given name).

Mark Peterson · Answer 6 · 2022-04-14T18:44:12.573

I'm going for robust here, rather than speed or elegance. I am certain that it could be cleaned up and sped up, but this addresses all of the discussed concerns including the added specification in the comments about handling family and given name matches separately.

First, here is updated data that has the family and given names split:

df1_split <-
  tibble(
    Given = c("JUAN", "GINA", "JUAN FERNANDO"
              , "MARCO", "JUAN PABLO", "IRMA"
              , "JUAN", "JUAN CARLOS")
    , Family = c("GIRON", "OLEAS", "ELIZAGA"
                 , "TORRES", "GONZALEZ", "GOMEZ"
                 , "GOMEZ", "MARTINEZ")
  )
  
df2_split <-
  tibble(
    Given = c("JUANA", "FERNANDO", 
              "IRMA", "PABLO", "GINA", "MARK", 
              "LETICIA", "JUAN PABLO"
              , "FERNANDO CARLOS"
              , "JUAN FERNANDO")
    , Family = c("MARQUEZ", "ELIZAGA", 
                 "GOMEZ", "GONZALEZ", "LUCIO", "TORRES", 
                 "BLACIO", "GIRON BELTRAN"
                 , "MARTINEZ"
                 , "ELIZAGA")
  )

Note that I added a couple of names to highlight some of the problems with the doubled names.

Then, this function will check any name set. It handles the names differently depending on whether they are single (e.g. "Juan") or dual-named (e.g., "Juan Carlos"). For single names to be checked, it just looks if they are present at all. For dual-named names, it checks that they are in the same order if the name to check against has two names as well or if either name is present if the name to check against only has one name.

check_names <- function(to_check, against){
  split_against <-
    str_split(against, " ")
  
  str_split(to_check, " ") %>%
    lapply(function(this_name){
      if(length(this_name) == 1){
        to_ret <-
          sapply(split_against, function(this_against){
            any(this_name == this_against)
          }) %>%
          which
      } else if(length(this_name) == 2){
        to_ret <-
          sapply(split_against, function(this_against){
            if(length(this_against) ==  2){
              return(all(this_against == this_name))
            } else if(length(this_against) ==  1){
              return(any(this_against == this_name))
            } else{
              stop("Names (against) cannot have three words: "
                   , this_against)
            }
          }) %>%
          which
      } else{
        stop("Names (to_check) cannot have three words: "
             , this_name)
      }
    })
  
}

Then, we wrap that function together to pass the family and given names separately. The results are then checked to see if there are any indices that are matches for both the family and given names.

check_both_simple <- function(to_check_given, to_check_family, against_given, against_family){
  checked_given <- check_names(to_check_given, against_given)
  checked_family <- check_names(to_check_family, against_family)
  
  valid_matches <- lapply(1:length(checked_given), function(idx){
    checked_given[[idx]][checked_given[[idx]] %in% checked_family[[idx]]]
  })
  
  to_return <-
    ifelse(sapply(valid_matches, length) > 0
           , "yes"
           , "no")
  
  return(to_return)
  
}

We can then use this in a call from mutate to add the column:

df1_split %>%
  mutate(Match = check_both_simple(Given, Family
                                   , df2_split$Given
                                   , df2_split$Family))

Returns:

# A tibble: 8 × 3
  Given         Family   Match
  <chr>         <chr>    <chr>
1 JUAN          GIRON    yes  
2 GINA          OLEAS    no   
3 JUAN FERNANDO ELIZAGA  yes  
4 MARCO         TORRES   no   
5 JUAN PABLO    GONZALEZ yes  
6 IRMA          GOMEZ    yes  
7 JUAN          GOMEZ    no   
8 JUAN CARLOS   MARTINEZ no

And should handle all of the odd edge cases discussed in the comments as well.

The nice thing about this approach (and part of why I built it this robustly in the first place), is that you can also set the function to return the matching indices.

check_both_idx <- function(to_check_given, to_check_family, against_given, against_family){
  checked_given <- check_names(to_check_given, against_given)
  checked_family <- check_names(to_check_family, against_family)
  
  valid_matches <- lapply(1:length(checked_given), function(idx){
    checked_given[[idx]][checked_given[[idx]] %in% checked_family[[idx]]]
  })
  
  return(valid_matches)
  
}

Using this, you can actually pull the matches that are found and manually inspect them. This would allow you to identify any additional edge cases where you didn't agree with a found match or where one match may be markedly better than another.

df1_split %>%
  mutate(Match_idx = check_both_idx(Given, Family
                                    , df2_split$Given
                                    , df2_split$Family)
         , Matches = sapply(Match_idx, function(this_idx_set){
           paste(df2_split$Given[this_idx_set]
                 , df2_split$Family[this_idx_set]) %>%
             paste(collapse = "; ")
         })
         , Match = ifelse(sapply(Match_idx, length) > 0
                          , "yes"
                          , "no")
         , Match_idx = sapply(Match_idx, paste, collapse = "; ")
         )

Returns:

# A tibble: 8 × 5
  Given         Family   Match_idx Matches                                   Match
  <chr>         <chr>    <chr>     <chr>                                     <chr>
1 JUAN          GIRON    "8"       "JUAN PABLO GIRON BELTRAN"                yes  
2 GINA          OLEAS    ""        ""                                        no   
3 JUAN FERNANDO ELIZAGA  "2; 10"   "FERNANDO ELIZAGA; JUAN FERNANDO ELIZAGA" yes  
4 MARCO         TORRES   ""        ""                                        no   
5 JUAN PABLO    GONZALEZ "4"       "PABLO GONZALEZ"                          yes  
6 IRMA          GOMEZ    "3"       "IRMA GOMEZ"                              yes  
7 JUAN          GOMEZ    ""        ""                                        no   
8 JUAN CARLOS   MARTINEZ ""        ""                                        no

Editing to add: The following two sets present some tricky examples that would currently be decided incorrectly by the other answers. These examples arose from the discussion in the comments to clarify what should match.

tricky_1 <-
  tibble(
    Given = c("JUAN", "JUANITA GINA"
              , "JUAN CARLO", "GOMEZ")
    , Family = c("GIRON BELTRAN", "OLEAS"
                 , "MARTINEZ", "IRMA")
  )


tricky_2 <-
  tibble(
    Given = c("JUAN PABLO", "GINA"
              , "CARLO JUAN", "IRMA")
    , Family = c("GIRON", "OLEAS GIRON"
                 , "MARTINEZ", "GOMEZ")
  )

We can view them side by side like so:

bind_cols(
  tricky_1 %>%
    setNames(paste0("toCheck_", names(.)))
  , tricky_2 %>%
    setNames(paste0("against_", names(.)))
) %>%
  mutate(shouldMatch = c("yes", "yes", "no", "no"))

returns:

# A tibble: 4 × 5
  toCheck_Given toCheck_Family against_Given against_Family shouldMatch
  <chr>         <chr>          <chr>         <chr>          <chr>      
1 JUAN          GIRON BELTRAN  JUAN PABLO    GIRON          yes        
2 JUANITA GINA  OLEAS          GINA          OLEAS GIRON    yes        
3 JUAN CARLO    MARTINEZ       CARLO JUAN    MARTINEZ       no         
4 GOMEZ         IRMA           IRMA          GOMEZ          no

The first two should match because the family and given names each have a 1-2 match in each direction. However, that means that neither the name to check or the name to check against is completely contained in the other. The third shares all of the components, but I believe that "JUAN CARLO" should not match "CARLO JUAN." The fourth has the family and given names reversed, so shouldn't return a match.

The code from my answer handles these cases:

tricky_1 %>%
  mutate(Match_idx = check_both_idx(Given, Family
                                    , tricky_2$Given
                                    , tricky_2$Family)
         , Matches = sapply(Match_idx, function(this_idx_set){
           paste(tricky_2$Given[this_idx_set]
                 , tricky_2$Family[this_idx_set]) %>%
             paste(collapse = "; ")
         })
         , Match = ifelse(sapply(Match_idx, length) > 0
                          , "yes"
                          , "no")
         , Match_idx = sapply(Match_idx, paste, collapse = "; ")
  )

Returns:

# A tibble: 4 × 5
  Given        Family        Match_idx Matches            Match
  <chr>        <chr>         <chr>     <chr>              <chr>
1 JUAN         GIRON BELTRAN "1"       "JUAN PABLO GIRON" yes  
2 JUANITA GINA OLEAS         "2"       "GINA OLEAS GIRON" yes  
3 JUAN CARLO   MARTINEZ      ""        ""                 no   
4 GOMEZ        IRMA          ""        ""                 no

score 0 · Answer 7 · answered Apr 11 '22 at 13:41

df2_flat <- df2$name |> stringr::str_split(" ") |> purrr::flatten()                                                                                                                                                                                                 

df1 |> 
    mutate(splitnames = stringr::str_split(name, " ")) |> 
    rowwise() |>
    mutate(val = all(splitnames %in% df2_flat)) |>
    select(-splitnames)

Produces the list you specified

score 0 · Answer 8 · answered Apr 17 '22 at 20:47

How I would go about it

library(tidyverse)

df1 %>% 
  mutate(val = sapply(name, \(n) {
    
    result = strsplit(n, " ")[[1]] %>% 
      sapply(., \(sn) { #loop through each name and look in df2
        sum(grepl(sn, df2$name)) #Could be rewritten to a for-loop to break at the first missed name
      })
    
    if(0 %in% result) {
      return("no") #at least one name was not found
    } else {
      return("yes") #defaults to yes, since this can only be reached if no names have been missed (i.e. all have been found)
    }
  }))

score 0 · Answer 9 · answered Apr 17 '22 at 20:58

Adding another solution that I didn't see above:

Basically use either left_join(), anti_join and full_join to match the values in the dataframe

#see which values in df2 match df1
left_join(df1,df2,by="name")

#see which values don't match in either df1 and df2
anti_join(df1,df2,by="name")

How to check if values in first dataframe are contained or match values in another dataframe

9 Answers9

Data and Packages

Demo on small data set

Benchmark Code

Benchmark Result

Linked