42

I've been working on a way to join two datasets based on a imperfect string, such as a name of a company. In the past I had to match two very dirty lists, one list had names and financial information, another list had names and address. Neither had unique IDs to match on! ASSUME THAT CLEANING HAS ALREADY BEEN APPLIED AND THERE MAYBE TYPOS AND INSERTIONS.

So far AGREP is the closest tool I've found that might work. I can use levenshtein distances in the AGREP package, which measure the number of deletions, insertions and substitutions between two strings. AGREP will return the string with the smallest distance (the most similar).

However, I've been having trouble turning this command from a single value to apply it to an entire data frame. I've crudely used a for loop to repeat the AGREP function, but there's gotta be an easier way.

See the following code:

a<-data.frame(name=c('Ace Co','Bayes', 'asd', 'Bcy', 'Baes', 'Bays'),price=c(10,13,2,1,15,1))
b<-data.frame(name=c('Ace Co.','Bayes Inc.','asdf'),qty=c(9,99,10))

for (i in 1:6){
    a$x[i] = agrep(a$name[i], b$name, value = TRUE, max = list(del = 0.2, ins = 0.3, sub = 0.4))
    a$Y[i] = agrep(a$name[i], b$name, value = FALSE, max = list(del = 0.2, ins = 0.3, sub = 0.4))
}
pnuts
  • 58,317
  • 11
  • 87
  • 139
A L
  • 613
  • 1
  • 7
  • 7
  • 7
    Based on everyone feedback and some poking around from me, I created a function that solve my exact problem. Code can be found here: https://github.com/Adamishere/Fuzzymatching/blob/master/Fuzzy%20String%20Match%20FunctionV1.R – A L Mar 01 '17 at 15:59
  • thank you this function. This is quite useful. However I am not able to pass my column in string1, string2 and id2. My data is in data.table so not sure how I should pass them when calling the function. Could you please suggest. Sorry if my question is very basic, I have started learning R and still a long way to go – user1412 Mar 11 '17 at 17:34
  • I would just use data.frame(), then once the match is done, convert to data.table() – A L Mar 21 '17 at 20:46
  • 1
    the fuzzyjoin package might help - see answer below using fuzzyjoin::stringdist_left_join – Arthur Yip Jun 06 '17 at 04:03
  • how does the function work if there isn't simply a 1 variable data frame? It doesn't work in my case when I have 2 dataframes with multiple columns in each. – jvalenti May 12 '21 at 15:40

7 Answers7

32

Here is a solution using the fuzzyjoin package. It uses dplyr-like syntax and stringdist as one of the possible types of fuzzy matching.

As suggested by @C8H10N4O2, the stringdist method="jw" creates the best matches for your example.

As suggested by @dgrtwo, the developer of fuzzyjoin, I used a large max_dist and then used dplyr::group_by and dplyr::slice_min to get only the best match with minimum distance. (slice_min replaces the older top_n and if the original order is important and not alphabetical, use mutate(rank = row_number(dist)) %>% filter(rank == 1))


a <- data.frame(name = c('Ace Co', 'Bayes', 'asd', 'Bcy', 'Baes', 'Bays'),
                price = c(10, 13, 2, 1, 15, 1))
b <- data.frame(name = c('Ace Co.', 'Bayes Inc.', 'asdf'),
                qty = c(9, 99, 10))

library(fuzzyjoin); library(dplyr);

stringdist_join(a, b, 
                by = "name",
                mode = "left",
                ignore_case = FALSE, 
                method = "jw", 
                max_dist = 99, 
                distance_col = "dist") %>%
  group_by(name.x) %>%
  slice_min(order_by = dist, n = 1)

#> # A tibble: 6 x 5
#> # Groups:   name.x [6]
#>   name.x price     name.y   qty       dist
#>   <fctr> <dbl>     <fctr> <dbl>      <dbl>
#> 1 Ace Co    10    Ace Co.     9 0.04761905
#> 2  Bayes    13 Bayes Inc.    99 0.16666667
#> 3    asd     2       asdf    10 0.08333333
#> 4    Bcy     1 Bayes Inc.    99 0.37777778
#> 5   Baes    15 Bayes Inc.    99 0.20000000
#> 6   Bays     1 Bayes Inc.    99 0.20000000
Arthur Yip
  • 5,810
  • 2
  • 31
  • 50
  • I am running your code on a large data set, but my computer is less of a fan of the code than I am: `Error: cannot allocate vector of size 2.6 Gb` I was thinking. Would decreasing the maximum distance improve the allocation problem? Or is there some other way to make it less memory intensive? – Tom Dec 07 '20 at 09:40
  • Decreasing the maximum distance did not really help. – Tom Dec 07 '20 at 09:50
  • maybe split up your data into several chunks before running – Arthur Yip Dec 08 '20 at 02:31
  • I see you tried to do that and posted a new question and I've reviewed your code - you were on the right track! – Arthur Yip Dec 08 '20 at 05:04
  • @ArthurYip, how would you manipulate the code, if you also had to match a column from data frame a to data frame b? in my case, I want to run your code but I can add another condition while the name of a column from both dataframes must be the same? – Alex Feb 18 '21 at 15:31
  • 1
    @Alex With fuzzyjoin, you can set match_fun to be `==`, stringdist and to by = column a, column b. Can get you the syntax in a bit. – Arthur Yip Feb 18 '21 at 20:00
  • 1
    @Alex Here are some answers that will help: https://stackoverflow.com/questions/44383510/passing-arguments-into-multiple-match-fun-functions-in-r-fuzzyjoinfuzzy-join https://stackoverflow.com/questions/42793833/r-function-for-a-function-to-be-repeated-based-on-column-values/44383103#44383103 https://stackoverflow.com/questions/56009863/how-to-fuzzy-join-based-on-multiple-columns-and-conditions/64439813#64439813 https://stackoverflow.com/questions/58442426/how-do-i-do-one-fuzzy-and-one-exact-match-in-a-dataframe/64440492#64440492 – Arthur Yip Feb 18 '21 at 22:08
  • @ArthurYip, Thanks so much. my code is:`stringdist_join(a, b, by =list(x = c("name","city"), y = c("last_name","city")), match_fun = list(match_fun_distance,`==`), mode = "left", ignore_case = FALSE, method = "jw", max_dist = 99, distance_col = "dist") %>% group_by(name) %>% slice_min(order_by = dist, n = 1)` and I get this error `Error in stringdist::stringdist(v1[include], v2[include], method = method, : unused argument (match_fun = list(match_fun_distance, `==`)) ` – Alex Feb 19 '21 at 06:15
  • 1
    @Alex you have to switch to fuzzy_join to use the list of match_fun's and columns to join by – Arthur Yip Feb 19 '21 at 06:17
  • @ArthurYip, thanks. I modified the code as you mentioned as follows: `fuzzy_join(a, b, by = list(x = c("name","city"), y = c("last_name","city")), match_fun = list(match_fun_stringdist,`==`), mode = "left") %>% group_by(name) %>% slice_min(order_by = dist, n = 1) ` and I used the function of here [https://stackoverflow.com/questions/42793833/r-function-for-a-function-to-be-repeated-based-on-column-values/44383103#44383103] – Alex Feb 19 '21 at 14:47
  • @ArthurYip. I got another error: `Error in order(x) : argument 1 is not a vector In addition: Warning message: data_frame() is deprecated as of tibble 1.1.0. Please use tibble() instead `. would you please advise, thanks again. – Alex Feb 19 '21 at 14:51
  • 1
    @Alex the first part without group_by and slice_min should be ok now. it's because now there's no column named dist. now you have name.dist and city.dist, so change that "dist" to "name.dist" – Arthur Yip Feb 19 '21 at 19:37
  • @ArthurYip, thanks! worked perfectly. one issue though, (all `name.dist` are **NA**. is that the way it works because the two columns of `names` are equal to each other? – Alex Feb 20 '21 at 18:43
  • @ArthurYip, just wonder if you have any clues about the `NA` – Alex Feb 22 '21 at 06:42
  • 1
    are name.dist and city.dist both NA? it's NA for the column using `==` since there's no distance to calculate – Arthur Yip Feb 22 '21 at 09:19
  • 1
    @ArthurYip, its `NA` for the `==` column. Alright, it seems its all set. Thanks again! – Alex Feb 22 '21 at 16:02
25

The solution depends on the desired cardinality of your matching a to b. If it's one-to-one, you will get the three closest matches above. If it's many-to-one, you will get six.

One-to-one case (requires assignment algorithm):

When I've had to do this before I treat it as an assignment problem with a distance matrix and an assignment heuristic (greedy assignment used below). If you want an "optimal" solution you'd be better off with optim.

Not familiar with AGREP but here's example using stringdist for your distance matrix.

library(stringdist)
d <- expand.grid(a$name,b$name) # Distance matrix in long form
names(d) <- c("a_name","b_name")
d$dist <- stringdist(d$a_name,d$b_name, method="jw") # String edit distance (use your favorite function here)

# Greedy assignment heuristic (Your favorite heuristic here)
greedyAssign <- function(a,b,d){
  x <- numeric(length(a)) # assgn variable: 0 for unassigned but assignable, 
  # 1 for already assigned, -1 for unassigned and unassignable
  while(any(x==0)){
    min_d <- min(d[x==0]) # identify closest pair, arbitrarily selecting 1st if multiple pairs
    a_sel <- a[d==min_d & x==0][1] 
    b_sel <- b[d==min_d & a == a_sel & x==0][1] 
    x[a==a_sel & b == b_sel] <- 1
    x[x==0 & (a==a_sel|b==b_sel)] <- -1
  }
  cbind(a=a[x==1],b=b[x==1],d=d[x==1])
}
data.frame(greedyAssign(as.character(d$a_name),as.character(d$b_name),d$dist))

Produces the assignment:

       a          b       d
1 Ace Co    Ace Co. 0.04762
2  Bayes Bayes Inc. 0.16667
3    asd       asdf 0.08333

I'm sure there's a much more elegant way to do the greedy assignment heuristic, but the above works for me.

Many-to-one case (not an assignment problem):

do.call(rbind, unname(by(d, d$a_name, function(x) x[x$dist == min(x$dist),])))

Produces the result:

   a_name     b_name    dist
1  Ace Co    Ace Co. 0.04762
11   Baes Bayes Inc. 0.20000
8   Bayes Bayes Inc. 0.16667
12   Bays Bayes Inc. 0.20000
10    Bcy Bayes Inc. 0.37778
15    asd       asdf 0.08333

Edit: use method="jw" to produce desired results. See help("stringdist-package")

C8H10N4O2
  • 18,312
  • 8
  • 98
  • 134
  • Thanks! This is very helpful. Although I am curious, in the many-to-one case, the results do not seem correct as they are not returning the best matches, after the first row. – A L Oct 16 '14 at 19:47
  • @Adam Lee depends on how you define "best" matches. See `?stringdist` or `?adist` for more on the default distance metrics. Using either of these functions with default arguments, "Bayes" is one edit closer to "asdf" than it is to "Bayes Inc." – C8H10N4O2 Oct 16 '14 at 20:28
  • Ah I see! Thank you, so it was a matter of the distance metrics used that causing that. Again this is very helpful! – A L Oct 17 '14 at 19:07
  • @C8H10N4O2, I was looking for some suggestions on how we can get the data for some specific equivalent column of the match from 2nd dataset. I have posted the question at - http://stackoverflow.com/questions/42749447/r-fuzzy-string-match-to-return-specific-column-based-on-matched-string It would be great help if you can suggest something – user1412 Mar 12 '17 at 17:11
  • 1
    This was very helpful - thanks. I found this scales up a lot further if you filter d$dist before calling the greedyAssign function, e.g. `d <- d[d$dist < 0.2,]` . After running the code above (without a filter) for a sample, you can usually pick a crude cutoff point beyond which the proposed matches are unlikely to be useful. – Mike Honey Sep 04 '17 at 01:00
  • @C8H10N4O2 In the expand.grid(), I need to keep my ID columns alongside the fuzzy matches (a_name) and ID from b alongside b_name. Is that possible here? I will really appreciate it. – pyeR_biz Mar 08 '18 at 22:32
3

I am not sure if this is a useful direction for you, John Andrews, but it gives you another tool (from the RecordLinkage package) and might help.

install.packages("ipred")
install.packages("evd")
install.packages("RSQLite")
install.packages("ff")
install.packages("ffbase")
install.packages("ada")
install.packages("~/RecordLinkage_0.4-1.tar.gz", repos = NULL, type = "source")

require(RecordLinkage) # it is not on CRAN so you must load source from Github, and there are 7 dependent packages, as per above

compareJW <- function(string, vec, cutoff) {
  require(RecordLinkage)
  jarowinkler(string, vec) > cutoff
}

a<-data.frame(name=c('Ace Co','Bayes', 'asd', 'Bcy', 'Baes', 'Bays'),price=c(10,13,2,1,15,1))
b<-data.frame(name=c('Ace Co.','Bayes Inc.','asdf'),qty=c(9,99,10))
a$name <- as.character(a$name)
b$name <- as.character(b$name)

test <- compareJW(string = a$name, vec = b$name, cutoff = 0.8)  # pick your level of cutoff, of course
data.frame(name = a$name, price = a$price, test = test)

> data.frame(name = a$name, price = a$price, test = test)
    name price  test
1 Ace Co    10  TRUE
2  Bayes    13  TRUE
3    asd     2  TRUE
4    Bcy     1 FALSE
5   Baes    15  TRUE
6   Bays     1 FALSE
lawyeR
  • 7,488
  • 5
  • 33
  • 63
3

Fuzzy Matching

Approximate String Matching is approximately matching one string to another. e.g. banana and bananas.
Fuzzy Matching is finding an approximate pattern in a string. e.g. banana within bananas in pyjamas.

Method R Implementation
Basic Bitap≈Levenshtein b$name <- lapply(b$name, agrep, a$name, value=TRUE); merge(a,b)
Advanced ?stringdist::stringdist-metrics fuzzyjoin::stringdist_join(a, b, mode='full', by=c('name'), method='lv')
Fuzzy Match TRE agrep2 <- function(pattern, x) x[which.min(adist(pattern, x, partial=TRUE))]; b$name <- lapply(b$name, agrep2, a$name); merge(a, b)

Run yourself

# Data
a <- data.frame(name=c('Ace Co.', 'Bayes Inc.', 'asdf'), qty=c(9,99,10))
b <- data.frame(name=c('Ace Company', 'Bayes', 'asd', 'Bcy', 'Baes', 'Bays'), price=c(10,13,2,1,15,1))

# Basic
c <- b
c$name.b <- c$name
c$name <- lapply(c$name, agrep, a$name, value=TRUE)
merge(a, c, all.x=TRUE)

# Advanced
fuzzyjoin::stringdist_join(a, b, mode='full')

# Fuzzy Match
c <- b
c$name.b <- c$name
c$name <- lapply(c$name, function(pattern, x) x[which.min(adist(pattern, x, partial=TRUE))], a$name)
merge(a, c)
A. West
  • 571
  • 5
  • 12
2

I use lapply for those circumstances:

yournewvector: lapply(yourvector$yourvariable, agrep, yourothervector$yourothervariable, max.distance=0.01),

then to write it as a csv it's not so straightforward:

write.csv(matrix(yournewvector, ncol=1), file="yournewvector.csv", row.names=FALSE)
1

Agreed with above answer "Not familiar with AGREP but here's example using stringdist for your distance matrix." but add-on the signature function as below from Merging Data Sets Based on Partially Matched Data Elements will be more accurate since the calculation of LV is based on position/addition/deletion

##Here's where the algorithm starts...
##I'm going to generate a signature from country names to reduce some of the minor differences between strings
##In this case, convert all characters to lower case, sort the words alphabetically, and then concatenate them with no spaces.
##So for example, United Kingdom would become kingdomunited
##We might also remove stopwords such as 'the' and 'of'.
signature=function(x){
  sig=paste(sort(unlist(strsplit(tolower(x)," "))),collapse='')
  return(sig)
}
-1

Here is what I used for getting number of times a company appears in a list though the company names are inexact matches,

step.1 Install phonics Package

step.2 create a new column called "soundexcodes" in "mylistofcompanynames"

step.3 Use soundex function to return soundex codes of the company names in "soundexcodes"

step.4 Copy the company names AND corresponding soundex code into a new file (2 columns called "companynames" and "soundexcode") called "companysoundexcodestrainingfile"

step.5 Remove duplicates of soundexcodes in "companysoundexcodestrainingfile"

step.6 Go through the list of remaining company names and change the names as you want it to appear in your original company

example: Amazon Inc A625 can be Amazon A625 Accenture Limited A455 can be Accenture A455

step.6 Perform a left_join or (simple vlookup) between companysoundexcodestrainingfile$soundexcodes and mylistofcompanynames$soundexcodes by "soundexcodes"

step.7 The result should have the original list with a new column called "co.y" which has the name of the company the way you left it in the training file.

step.8 Sort "co.y" and check if most of the company names are matched correctly,if so replace the old company names with the new ones given by vlookup of the soundex code.

marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459