How to use R, fuzzy logic and/or stringr to match shared words in strings among dataframes

Question

I am attempting to join two dataframes in a way that matches strings based on the number of matching words that they share. For example I would only like to match words that share 2 or more words (i.e "alpha bravo" shares two words with "charlie bravo alpha"). Sort of like a fuzzy logic method based on shared words, rather than shared characters. I have provided two data frames, one with the original data and another that it should be matched to. The last dataframe is what I would like my output to look similar to where the two dataframes are matched and joined and matchingWords is the number of shared words between letters and NewLetters. You'll also notice that if one string from the originalData matches with two different strings from matchData it will add another row to include both matches.

Any help here would be greatly appreciated.

#Reproducible data 
originalData <- data.frame(letters = c("foxtrot alpha echo", "lima golf", "kilo bravo hotel","whiskey quebec november", "india echo charlie alpha"))
matchData <- data.frame(newLetters = c("romeo golf lima", "tango charlie bravo","alpha echo whiskey", "hotel"  , "quebec golf foxtrot", "echo november bravo", "charlie alpha", "india november whiskey"),
                        numbers = 1:8,
                        yesNo = rep(c("yes", "no"), 4))

#I would like to get an end product similar to this 
desiredDataframe <- data.frame(letters = c("foxtrot alpha echo", "lima golf", "kilo bravo hotel","whiskey quebec november", "india echo charlie alpha","india echo charlie alpha"),
                               newLetters = c("alpha echo whiskey", "romeo golf lima", NA, "india november whiskey", "alpha echo whiskey", "charlie alpha"),
                               matchingWords = c(2, 2, NA, 2, 2, 2),
                               numbers = c(3, 1, NA, 8, 3, 7),
                               yesNo = c("yes", "yes", NA, "no", "yes", "no"))

I have tested a few methods of ways for getting the number of matching words such as for two specific string:

length(intersect(str_split("foxtrot hotel bravo", " ")[[1]], str_split("bravo golf hotel", " ")[[1]]))

or

sum(str_split("foxtrot hotel bravo", " ")[[1]] %in% str_split("bravo golf hotel", " ")[[1]])

But have been unsuccessful in fuzzy matching these string by the number of matching words. The use of regular fuzzy logic (by characters or similar methods) does not work when there are entire words missing/different. Perhaps str_match() or str_extract() might be of use here

maybe this can be useful? https://stackoverflow.com/questions/26405895/how-can-i-match-fuzzy-match-strings-from-two-datasets — stats_noob, Jul 10 '21 at 06:15
@stats555 That post looks into Levenshtein distance which looks at insertions deletions and replacements. As stated in the question, this method will not work because an extra word may add up to a distance of something like 6 (which is not similar for Levenshtein distances. — aczich, Jul 10 '21 at 13:16

How to use R, fuzzy logic and/or stringr to match shared words in strings among dataframes

0 Answers0