I am attempting to join two dataframes in a way that matches strings based on the number of matching words that they share. For example I would only like to match words that share 2 or more words (i.e "alpha bravo" shares two words with "charlie bravo alpha"). Sort of like a fuzzy logic method based on shared words, rather than shared characters. I have provided two data frames, one with the original data and another that it should be matched to. The last dataframe is what I would like my output to look similar to where the two dataframes are matched and joined and matchingWords is the number of shared words between letters and NewLetters. You'll also notice that if one string from the originalData matches with two different strings from matchData it will add another row to include both matches.
Any help here would be greatly appreciated.
#Reproducible data
originalData <- data.frame(letters = c("foxtrot alpha echo", "lima golf", "kilo bravo hotel","whiskey quebec november", "india echo charlie alpha"))
matchData <- data.frame(newLetters = c("romeo golf lima", "tango charlie bravo","alpha echo whiskey", "hotel" , "quebec golf foxtrot", "echo november bravo", "charlie alpha", "india november whiskey"),
numbers = 1:8,
yesNo = rep(c("yes", "no"), 4))
#I would like to get an end product similar to this
desiredDataframe <- data.frame(letters = c("foxtrot alpha echo", "lima golf", "kilo bravo hotel","whiskey quebec november", "india echo charlie alpha","india echo charlie alpha"),
newLetters = c("alpha echo whiskey", "romeo golf lima", NA, "india november whiskey", "alpha echo whiskey", "charlie alpha"),
matchingWords = c(2, 2, NA, 2, 2, 2),
numbers = c(3, 1, NA, 8, 3, 7),
yesNo = c("yes", "yes", NA, "no", "yes", "no"))
I have tested a few methods of ways for getting the number of matching words such as for two specific string:
length(intersect(str_split("foxtrot hotel bravo", " ")[[1]], str_split("bravo golf hotel", " ")[[1]]))
or
sum(str_split("foxtrot hotel bravo", " ")[[1]] %in% str_split("bravo golf hotel", " ")[[1]])
But have been unsuccessful in fuzzy matching these string by the number of matching words. The use of regular fuzzy logic (by characters or similar methods) does not work when there are entire words missing/different. Perhaps str_match() or str_extract() might be of use here