Closest match of a sentence between two data frames in R

Question

I have two data frames. The first one - saved in an object named b:

structure(list(CONTENT = c("@myntra beautiful teamä»ç where is the winners list?", 
"The best ever Puma wishlist for Workout freaks, Head over to @myntra  https://t.co/V58Gk3EblW #MyPUMACollection Hit Like if you Find it good", 
"I finalised on buy a top from Myntra, and then I found the same top at 20% off in jabong. I feel like I've achieved so much in life!", 
"Check out #myPUMAcollection on @Myntra. Its perfect for a day at gym.  https://t.co/VeRy4G3c7X https://t.co/fOpBRWCdSh", 
"Check out #myPUMAcollection on @Myntra. Its perfect for a day at gym.  https://t.co/VeRy4G3c7X.....", 
"@DrDrupad @myntra #myPUMAcollection superb :)", "Super exclusive collection @myntra #myPUMAcollection   https://t.co/Qm9dZzJdms", 
"@myntra gave my best Love playing wid u Hope to win  #myPUMAcollection", 
"Check out PUMA Unisex Black Running Performance Gloves on Myntra!  https://t.co/YD6IcvuG98 @myntra  #myPUMAcollection", 
"@myntra i have been mailing my issue daily since past week.All i get in reply is an auto generated assurance mail. 1st time pissed wd myntra"
), score = c(7.129, 7.08, 6.676, 5.572, 5.572, 5.535, 5.424, 
5.205, 4.464, 4.245)), .Names = c("CONTENT", "score"), row.names = c(25L, 
103L, 95L, 66L, 90L, 75L, 107L, 32L, 184L, 2L), class = "data.frame")

The second database - saved in an object named c:

structure(list(CONTENT = c("The best ever for workout  over to myntra like if you find it good", 
"i finalised buy a top  myntra and found the at in feel like i so in life"
)), .Names = "CONTENT", row.names = c(103L, 95L), class = "data.frame")

I want to find for each statement in the second data frame (c), the closest match in the first data frame(b), and return the score from the first data frame(b).

For eg., the statement The best ever for workout over to myntra like if you find it good matches closely with the second statement from data frame 1 and hence I should return the score 7.080.

I tried using codes from stack overflow with some tweaks:

cp <- str_split(c$CONTENT, " ")
library(data.table)
nn <- lengths(cp)  ## Or, for < R-3.2.0, `nn <- sapply(wordList, length)` 
dt <- data.table(grp=rep(seq_along(nn), times=nn), X = unlist(cp), key="grp")
dt[,Score:=b$score[pmatch(X,b$CONTENT)]]
dt[!is.na(Score), list(avgScore=sum(Score)), by="grp"]

This returns the value for only one statement from df c. Can someone help?

Are you committed to this approach of `str_split` / `pmatch` for determining the best match for a given phrase? Because there are proper fuzzy matching algorithms for situations like this that may produce better results. — nrussell, Mar 05 '16 at 14:15
@nrussell not really...would be helpful if you can let me know the kind of fuzzy matching algorithms that can be deployed — LeArNr, Mar 05 '16 at 16:03

score 2 · Accepted Answer · answered Mar 05 '16 at 16:22

Here's one approach using stringsim from the stringdist package. There are several methods (algorithms) to choose from -- I settled on the Jaro distance metric for computing similarity because it seemed to produce reasonable results for your data. Having said that, my experience with this subject is casual at best, so you may want to spend some time reading up on - and experimenting with - the various algorithms provided by stringdist.

To reduce clutter, I used this wrapper function to return the index of the most similar (highest similarity value) element for a given string,

library(stringdist)
library(data.table)

best_match <- function(x, y, method = "jw", ...) {
    which.max(stringsim(x, y, method, ...))
}

and made a data.table with the strings to be matched, adding a dummy index for row-wise operations:

Dt <- data.table(
    MatchPhrase = df_c$CONTENT,
    Idx = 1:nrow(df_c)
)

Using best_match, add a column with the index of the best match (and drop the dummy Idx column afterwards),

Dt[, MatchIdx := best_match(df_b$CONTENT, MatchPhrase), 
    by = "Idx"][,Idx := NULL]

and extract the corresponding elements from df_b (I renamed your data from b and c to df_b and df_c, respectively):

Dt[, .(Score = df_b$score[MatchIdx],
       BestMatch = df_b$CONTENT[MatchIdx]),
   by = "MatchPhrase"]
#                                                                MatchPhrase Score
#1:       The best ever for workout  over to myntra like if you find it good 7.080
#2: i finalised buy a top  myntra and found the at in feel like i so in life 6.676

#                                                                                                                                      BestMatch
#1: The best ever Puma wishlist for Workout freaks, Head over to @myntra  https://t.co/V58Gk3EblW #MyPUMACollection Hit Like if you Find it good
#2:         I finalised on buy a top from Myntra, and then I found the same top at 20% off in jabong. I feel like I've achieved so much in life!

many thanks nrussell....it worked flawlessly with the example set. I will explore more on implementing this with my actual dataset. Thank you again. — LeArNr, Mar 05 '16 at 17:01
@nrussell....I did go through the Jaro Distance....and found quite interesting....thank you for introducing me fuzzy matching algorithms....never knew before...will be very helpful to me. — LeArNr, Mar 05 '16 at 17:29

Closest match of a sentence between two data frames in R

1 Answers1

Linked