Join data frames based fuzzy matching of strings

Question

library(tidyverse)
library(fuzzyjoin)
df1 <- tibble(col1 = c("Apple Shipping", "Banana Shipping", "FedEX USA Ground",
                       "FedEx USA Commercial", "FedEx International"),
              col2 = 1:5)
#> # A tibble: 5 x 2
#>   col1                  col2
#>   <chr>                <int>
#> 1 Apple Shipping           1
#> 2 Banana Shipping          2
#> 3 FedEX USA Ground         3
#> 4 FedEx USA Commercial     4
#> 5 FedEx International      5

df2 <- tibble(col3 = c("Banana", "FedEX USA"), col4 = c(700, 900))
#> # A tibble: 2 x 2
#>   col3       col4
#>   <chr>     <dbl>
#> 1 Banana      700
#> 2 FedEX USA   900

The two data frames I'm working with are shown above. I'd like to fuzzy join them on col1 and col3 to come up with something similar to what's directly shown below. Basically the rule would be, "If all the text in col3 is in any of col1 count that as a match".

#> # A tibble: 3 x 4
#>   col1                  col2  col3      col4
#>   <chr>                <int>  <chr>    <int>
#> 1 Banana Shipping          2  Banana     700
#> 2 FedEX USA Ground         3  FedEx USA  900
#> 3 FedEx USA Commercial     4  FedEx USA  900

This older SO question appears to offer the solution, but it doesn't quite seem to work in this case and I get the errors shown below:

df1 %>% regex_inner_join(df2, by = c(string = "col3"))
#> Error: All columns in a tibble must be 1d or 2d objects:
#> * Column `col` is NULL
#> Run `rlang::last_error()` to see where the error occurred.

library(stringr)
df1 %>% fuzzy_inner_join(df2, by = c("string" = "col3"), match_fun = str_detect)
#> Error: All columns in a tibble must be 1d or 2d objects:
#> * Column `col` is NULL
#> Run `rlang::last_error()` to see where the error occurred.

How do I perform this fuzzy join with R?

When the [*"older SO question"*](https://stackoverflow.com/questions/32914357/dplyr-inner-join-with-a-partial-string-match) you listed used `by = c(string = "seed")`, it was because `string` was a column name of one of the frames, not because `string` is token or "known thing" to `fuzzyjoin`. — r2evans, Apr 02 '20 at 14:57

Ian Campbell · Accepted Answer · 2020-04-02T15:18:31.623

Perhaps this is what you're looking for?

library(dplyr)
library(fuzzyjoin)
library(stringr)
df1 %>% fuzzy_inner_join(df2,by=c("col1" = "col3"),match_fun = str_detect)
## A tibble: 2 x 4
#  col1              col2 col3       col4
#  <chr>            <int> <chr>     <dbl>
#1 Banana Shipping      2 Banana      700
#2 FedEX USA Ground     3 FedEX USA   900

If you wanted to ignore case, you could define your own str_detect.

my_str_detect <- function(x,y){str_detect(x,regex(y, ignore_case = TRUE))}
df1 %>% fuzzy_inner_join(df2,by=c("col1" = "col3"),match_fun = my_str_detect)
## A tibble: 3 x 4
#  col1                  col2 col3       col4
#  <chr>                <int> <chr>     <dbl>
#1 Banana Shipping          2 Banana      700
#2 FedEX USA Ground         3 FedEX USA   900
#3 FedEx USA Commercial     4 FedEX USA   900

For bonus points you can use agrepl from this question.

You can modify the max.distance = argument and potentially add cost =. See help(agrepl) for more.

my_match_fun <- Vectorize(function(x,y) agrepl(x, y, ignore.case=TRUE, max.distance = 0.7, useBytes = TRUE))
df1 %>% fuzzy_inner_join(df2,by=c("col1" = "col3"),match_fun = my_match_fun)
## A tibble: 4 x 4
#  col1                  col2 col3       col4
#  <chr>                <int> <chr>     <dbl>
#1 Banana Shipping          2 Banana      700
#2 FedEX USA Ground         3 FedEX USA   900
#3 FedEx USA Commercial     4 FedEX USA   900
#4 FedEx International      5 FedEX USA   900

Why does your first solution exclude "FedEx USA Commercial", even though it contains the string "FedEx USA"? And in the opposite vein, why does your second solution include FedEx International, even though it does not contain the string "FedEx USA"? Perhaps this needs to be a new question? — Display name, Apr 02 '20 at 15:04
Because it doesn't match `FedEX` with a capital `X` to `FedEx` with a lowercase `x`. Apparently the way to ignore case is with enclosing the pattern in `regex(string, ignore_case = T)` which you could do by again defining a custom function. — Ian Campbell, Apr 02 '20 at 15:12

Join data frames based fuzzy matching of strings

1 Answers1