How can I fuzzy string match multiple strings from different sized data frames?

Question

I would like to match the strings from my first dataset with all of their closest common matches.

Data looks like:

dataset1:

California 
Texas 
Florida 
New York

dataset2:

Californiia 
callifoornia
T3xas
Te xas
texas
Fl0 rida
folrida
New york
new york

desired result is:

col_1                col_2              col_3            col4
California           Californiia        callifoornia
Texas                T3xas              texas            Te xas
Florida              folrida            Fl0 rida
New York             New york           new york

The question is:

How do I search for common strings between the first dataset and the second dataset, and generate a list of terms in the second dataset that align with each term in the first?

Thanks in advance.

Define "closest". What has your research re such notions of closeness found that is relevant? How are you supplying it in your program? Once you get a table with columns for correct & fuzzy, do you know how to do the separate step of turning multiple rows into a row with multiple columns?--You are really asking 2 questions here. Both are obviously likely faqs. What have found on SO about each? What are you able to do? — philipxy, Apr 23 '19 at 00:06
See `stringdist` package, and `dcast` in `data.table`. There is a way to do this nicely in R, but I don't have time to code this up right now. `stringdist` is relatively easy to use with some basic R chops. — JMT2080AD, Apr 23 '19 at 00:21
Lots of relevant info out there at Stackoverflow, e.g: - https://stackoverflow.com/questions/27975705/compare-strings-for-an-approximate-match/27975870 https://stackoverflow.com/questions/2231993/merging-two-data-frames-using-fuzzy-approximate-string-matching-in-r https://stackoverflow.com/questions/16145064/approximate-string-matching-in-r https://stackoverflow.com/questions/5721883/agrep-only-return-best-matches https://stackoverflow.com/questions/6044112/how-to-measure-similarity-between-strings etc etc — thelatemail, Apr 23 '19 at 00:41

score 0 · Answer 1 · answered Apr 23 '19 at 06:12

library(fuzzyjoin); library(tidyverse)
dataset1 %>%
  stringdist_left_join(dataset2, 
                       max_dist = 3) %>%
  rename(col_1 = "states.x") %>%
  group_by(col_1) %>%
  mutate(col = paste0("col_", row_number() + 1)) %>%
  spread(col, states.y)

#Joining by: "states"
## A tibble: 4 x 4
## Groups:   col_1 [4]
#  col_1      col_2       col_3        col_4
#  <chr>      <chr>       <chr>        <chr>
#1 California Californiia callifoornia NA   
#2 Florida    Fl0 rida    folrida      NA   
#3 New York   New york    new york     NA   
#4 Texas      T3xas       Te xas       texas

data:

dataset1 <- data.frame(states = c("California",
                                "Texas",
                                "Florida",
                                "New York"), 
                       stringsAsFactors = F)

dataset2 <- data.frame(stringsAsFactors = F,
  states = c(
    "Californiia",
    "callifoornia",
    "T3xas",
    "Te xas",
    "texas",
    "Fl0 rida",
    "folrida",
    "New york",
    "new york"
  )
)

Tim · Answer 2 · 2019-10-10T22:46:28.277

I read a bit about stringdist and came up with this. It's a workaround, but I like it. Can definitely be improved:

library(stringdist)
library(janitor)

ds1a <- read.csv('dataset1')
ds2a <- read.csv('dataset2')

distancematrix <- stringdistmatrix(ds2a$name, ds1a$name, useNames = T)


df <- data.frame(stringdistmatrix(ds2a$name, ds1a$name, useNames = T), ncol=maxcol in distance matrix)
# go thru this df, and every cell that's < 4, replace with the column name, otherwise replace with empty string

for (j in 1:ncol(df)) {
      trigger <- df[j,] < 4
      df[trigger , j] <- names(df)[j]
      df[!trigger , j] <- ""
}


df <- remove_constant(df)

write.csv(df, file="~/Desktop/df.csv")

How can I fuzzy string match multiple strings from different sized data frames?

2 Answers2