keeping the best string matched by fuzzy matching in R

Question

I have two dataframes in R. one a dataframe of the phrases I want to match along with their synonyms in another column (df.word), and the other a data frame of the strings I want to match along with codes (df.string). The strings are complicated but to make it easy say we have:

df.word <- data.frame(label = c('warm wet', 'warm dry', 'cold wet'),
                 synonym = c('hot and drizzling\nsunny and raining','sunny and clear sky\ndry sunny day', 'cold winds and raining\nsnowing'))

df.string <- data.frame(day = c(1,2,3,4),
                       weather = c('there would be some drizzling at dawn but we will have a hot day', 'today there are cold winds and a bit of raining or snowing at night', 'a sunny and clear sky is what we have today', 'a warm dry day'))

I want to create df.string$extract in which I want to have the best match available for the string.

a column like this

df$extract <- c('warm wet', 'cold wet', 'warm dry', 'warm dry')

thanks in advance for anyone helping.

actually the best match would be the longest match ... the one with the most number of words detected in the string. — ayeh, Jun 22 '20 at 17:41

mustafaakben · Answer 1 · 2020-06-22T15:57:12.557

There are a few points that I did not quite understand in your question; however, I am proposing a solution for your question. Check whether it will work for you.

I assume that you want to find the best-matching labels for the weather texts. If so, you can use stringsim function from library(stringdist) in the following way.

First Note: If you clean the \n in your data, the result will be more accurate. So, I clean them for this example, but if you want you can keep them.

Second Note: You can change the similarity distance based on the different methods. Here I used cosine similarity, which is a relatively good starting point. If you want to see the alternative methods, please see the reference of the function:

?stringsim

The clean data is as follow:

df.word <- data.frame(
    label = c("warm wet", "warm dry", "cold wet"),
    synonym = c(
        "hot and drizzling sunny and raining",
        "sunny and clear sky dry sunny day", 
        "cold winds and raining snowing"
    )
)

df.string <- data.frame(
    day = c(1, 2, 3, 4),
    weather = c(
        "there would be some drizzling at dawn but we will have a hot day",
        "today there are cold winds and a bit of raining or snowing at night", 
        "a sunny and clear sky is what we have today", 
        "a warm dry day"
    )
)

Install the library and load it

install.packages('stringdist')
library(stringdist)

Create a n x m matrix that contains the similarity scores for each whether text with each synonym. The rows show each whether text and the columns represent each synonym group.

match.scores <- sapply(          ## Create a nested loop with sapply
    seq_along(df.word$synonym),  ## Loop for each synonym as 'i'
    function(i) {
        sapply(
            seq_along(df.string$weather), ## Loop for each weather as 'j'
            function(j) {
                stringsim(df.word$synonym[i], df.string$weather[j], ## Check similarity 
                    method = "cosine", ## Method cosine  
                    q = 2 ## Size of the q -gram: 2 
                )
            }
        )
    }
)

r$> match.scores
          [,1]      [,2]       [,3]
[1,] 0.3657341 0.1919924 0.24629819
[2,] 0.6067799 0.2548236 0.73552828
[3,] 0.3333974 0.6300619 0.21791793
[4,] 0.1460593 0.4485426 0.03688556

Get the best matches across the rows for each whether text, find the labels with the highest matching scores, and add these labels to the data frame.

ranked.match <- apply(match.scores, 1, which.max)
df.string$extract <- df.word$label[ranked.match]

df.string

r$> df.string
  day                                                             weather  extract
1   1    there would be some drizzling at dawn but we will have a hot day warm wet
2   2 today there are cold winds and a bit of raining or snowing at night cold wet
3   3                         a sunny and clear sky is what we have today warm dry
4   4                                                      a warm dry day warm dry

Thank you for your complete explanation and answer. I tried the code with different "q"s ... the best was with q=5 but still less than half were correct... I want all the words of the labels/synonyms to be present but this does not happen with any method or q (I mean that for example I even get a "warm dry" label for a "snowing" weather). — ayeh, Jun 22 '20 at 16:12
this would have been fine.. [link] (https://stackoverflow.com/questions/62411327/how-to-extract-all-matching-patterns-words-in-a-string-in-a-dataframe-column) but the problem with this is that in this case I only want the best answer not all matches. — ayeh, Jun 22 '20 at 16:12
I see. Could you please submit an example of end work. Thus, I can get a better sense of what you want exactly. What would you like to see in the extracted column? If you do not mind giving an example. — mustafaakben, Jun 22 '20 at 16:50
I dont mind except that the text is in persian language.... it is actually text of the temperaments for some herbs in a traditional medical text... for example I have a herb which is hot in second degree and wet in third degree and what I need is "hot wet" ... or like a drug which "is balanced with slight hotness and dry in first degree" for which I need the label "balanced hot dry" ... but I with the code above I have the label **balanced hot dry** for a drug which only says a short sentence "it's dry". I don't know where the **balanced** and **hot** have come from. — ayeh, Jun 22 '20 at 17:15

keeping the best string matched by fuzzy matching in R

1 Answers1