0

I would like to match the strings from my first dataset with all of their closest common matches.

Data looks like:

dataset1:

California 
Texas 
Florida 
New York

dataset2:

Californiia 
callifoornia
T3xas
Te xas
texas
Fl0 rida
folrida
New york
new york

desired result is:

col_1                col_2              col_3            col4
California           Californiia        callifoornia
Texas                T3xas              texas            Te xas
Florida              folrida            Fl0 rida
New York             New york           new york

The question is:

  • How do I search for common strings between the first dataset and the second dataset, and generate a list of terms in the second dataset that align with each term in the first?

Thanks in advance.

Tim
  • 11
  • 1
  • Define "closest". What has your research re such notions of closeness found that is relevant? How are you supplying it in your program? Once you get a table with columns for correct & fuzzy, do you know how to do the separate step of turning multiple rows into a row with multiple columns?--You are really asking 2 questions here. Both are obviously likely faqs. What have found on SO about each? What are you able to do? – philipxy Apr 23 '19 at 00:06
  • See `stringdist` package, and `dcast` in `data.table`. There is a way to do this nicely in R, but I don't have time to code this up right now. `stringdist` is relatively easy to use with some basic R chops. – JMT2080AD Apr 23 '19 at 00:21
  • 1
    Lots of relevant info out there at Stackoverflow, e.g: - https://stackoverflow.com/questions/27975705/compare-strings-for-an-approximate-match/27975870 https://stackoverflow.com/questions/2231993/merging-two-data-frames-using-fuzzy-approximate-string-matching-in-r https://stackoverflow.com/questions/16145064/approximate-string-matching-in-r https://stackoverflow.com/questions/5721883/agrep-only-return-best-matches https://stackoverflow.com/questions/6044112/how-to-measure-similarity-between-strings etc etc – thelatemail Apr 23 '19 at 00:41

2 Answers2

0
library(fuzzyjoin); library(tidyverse)
dataset1 %>%
  stringdist_left_join(dataset2, 
                       max_dist = 3) %>%
  rename(col_1 = "states.x") %>%
  group_by(col_1) %>%
  mutate(col = paste0("col_", row_number() + 1)) %>%
  spread(col, states.y)

#Joining by: "states"
## A tibble: 4 x 4
## Groups:   col_1 [4]
#  col_1      col_2       col_3        col_4
#  <chr>      <chr>       <chr>        <chr>
#1 California Californiia callifoornia NA   
#2 Florida    Fl0 rida    folrida      NA   
#3 New York   New york    new york     NA   
#4 Texas      T3xas       Te xas       texas

data:

dataset1 <- data.frame(states = c("California",
                                "Texas",
                                "Florida",
                                "New York"), 
                       stringsAsFactors = F)

dataset2 <- data.frame(stringsAsFactors = F,
  states = c(
    "Californiia",
    "callifoornia",
    "T3xas",
    "Te xas",
    "texas",
    "Fl0 rida",
    "folrida",
    "New york",
    "new york"
  )
)
Jon Spring
  • 55,165
  • 4
  • 35
  • 53
0

I read a bit about stringdist and came up with this. It's a workaround, but I like it. Can definitely be improved:

library(stringdist)
library(janitor)

ds1a <- read.csv('dataset1')
ds2a <- read.csv('dataset2')

distancematrix <- stringdistmatrix(ds2a$name, ds1a$name, useNames = T)


df <- data.frame(stringdistmatrix(ds2a$name, ds1a$name, useNames = T), ncol=maxcol in distance matrix)
# go thru this df, and every cell that's < 4, replace with the column name, otherwise replace with empty string

for (j in 1:ncol(df)) {
      trigger <- df[j,] < 4
      df[trigger , j] <- names(df)[j]
      df[!trigger , j] <- ""
}


df <- remove_constant(df)

write.csv(df, file="~/Desktop/df.csv")
Tim
  • 11
  • 1