3

I am trying to get nearest matching string along with the score by using "stringdist" package with method = jw.(Jaro-winkler)

First data frame (df_1) consists of 2 columns and I want to get the nearest string from str_2 from df_2 and score for that match.
I have gone through the package and found some solution which I will show below:

    year = c(2001,2001,2002,2003,2005,2006)
    str_1 =c("The best ever Puma wishlist","I finalised on buy a top from Myntra","Its perfect for a day at gym",
             "Check out PUMA Unisex Black Running","i have been mailing my issue daily","xyz")
    
    df_1 = data.frame(year,str_1)
    
    ID = c(100,211,155,367,678,2356,927,829,397)
    str_2 = c("VeRy4G3c7X","i have been mailing my issue twice","I finalised on buy a top from jobong",
              "Qm9dZzJdms","Check out PUMA Unisex Black Running","The best Puma wishlist","Its a day at gym",
              "fOpBRWCdSh","")

    df_2 = data.frame(ID,str_2)

I need to get the nearest match from str_2 column from df_2, and the final table would look like below with:

    stringdist(  a,  b,  method = c( "jw")

    df_1$Nearest_matching = c("The best Puma wishlist","I finalised on buy a top from jobong","Its a day at gym","Check out PUMA Unisex Black Running","i have been mailing my issue twice",NA) 
    df_1$Nearest_matching_score =c(0.099,0.092,0.205,0,0.078,NA).
user438383
  • 5,716
  • 8
  • 28
  • 43
san1
  • 455
  • 2
  • 11

2 Answers2

2

Here is a way to find the closest match and score for each value in df_1$str_1.

library(dplyr)
library(purrr)
library(stringdist)

result <- bind_cols(df_1, map_df(df_1$str_1, function(x) {
  vals <- stringdist(x, df_2$str_2,  method = 'jw')
  data.frame(Nearest_matching =  df_2$str_2[which.min(vals)],
             Nearest_matching_score = max(vals))
}))

#  year                                str_1
#1 2001          The best ever Puma wishlist
#2 2001 I finalised on buy a top from Myntra
#3 2002         Its perfect for a day at gym
#4 2003  Check out PUMA Unisex Black Running
#5 2005   i have been mailing my issue daily
#6 2006                                  xyz

#                      Nearest_matching Nearest_matching_score
#1               The best Puma wishlist              0.7419753
#2 I finalised on buy a top from jobong              0.7481481
#3                     Its a day at gym              0.7428571
#4  Check out PUMA Unisex Black Running              0.6238095
#5   i have been mailing my issue twice              0.6235294
#6                           VeRy4G3c7X              1.0000000
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
2

Here is what I came to based on the documentation of the stringdist package:

First I created a distance matrix between str_1 and str_2, then I assigned column names to it like this:

nearest_matching <- stringdistmatrix(df_1$str_1,df_2$str_2,  method = "jw")
colnames(nearest_matching) <- str_2

Then I selected the smallest value (distance) from each row.

apply(nearest_matching, 1, FUN = min)

output:

> apply(nearest_matching, 1, FUN = min)
[1] 0.09960718 0.09259259 0.20535714 0.00000000 0.07843137 0.52222222

Finally, I wrote out the column names corresponding to these values:

colnames(nearest_matching)[apply(nearest_matching, 1, FUN = which.min)]

output:

> colnames(nearest_matching)[apply(nearest_matching, 1, FUN = which.min)]
[1] "The best Puma wishlist"               "I finalised on buy a top from jobong" "Its a day at gym"                    
[4] "Check out PUMA Unisex Black Running"  "i have been mailing my issue twice"   "VeRy4G3c7X" 
KacZdr
  • 1,267
  • 3
  • 8
  • 23