There are a few points that I did not quite understand in your question; however, I am proposing a solution for your question. Check whether it will work for you.
I assume that you want to find the best-matching labels for the weather texts. If so, you can use stringsim
function from library(stringdist)
in the following way.
First Note: If you clean the \n
in your data, the result will be more accurate. So, I clean them for this example, but if you want you can keep them.
Second Note: You can change the similarity distance based on the different methods. Here I used cosine similarity, which is a relatively good starting point. If you want to see the alternative methods, please see the reference of the function:
?stringsim
The clean data is as follow:
df.word <- data.frame(
label = c("warm wet", "warm dry", "cold wet"),
synonym = c(
"hot and drizzling sunny and raining",
"sunny and clear sky dry sunny day",
"cold winds and raining snowing"
)
)
df.string <- data.frame(
day = c(1, 2, 3, 4),
weather = c(
"there would be some drizzling at dawn but we will have a hot day",
"today there are cold winds and a bit of raining or snowing at night",
"a sunny and clear sky is what we have today",
"a warm dry day"
)
)
Install the library and load it
install.packages('stringdist')
library(stringdist)
Create a n x m
matrix that contains the similarity scores for each whether text with each synonym. The rows show each whether text and the columns represent each synonym group.
match.scores <- sapply( ## Create a nested loop with sapply
seq_along(df.word$synonym), ## Loop for each synonym as 'i'
function(i) {
sapply(
seq_along(df.string$weather), ## Loop for each weather as 'j'
function(j) {
stringsim(df.word$synonym[i], df.string$weather[j], ## Check similarity
method = "cosine", ## Method cosine
q = 2 ## Size of the q -gram: 2
)
}
)
}
)
r$> match.scores
[,1] [,2] [,3]
[1,] 0.3657341 0.1919924 0.24629819
[2,] 0.6067799 0.2548236 0.73552828
[3,] 0.3333974 0.6300619 0.21791793
[4,] 0.1460593 0.4485426 0.03688556
Get the best matches across the rows for each whether text, find the labels with the highest matching scores, and add these labels to the data frame.
ranked.match <- apply(match.scores, 1, which.max)
df.string$extract <- df.word$label[ranked.match]
df.string
r$> df.string
day weather extract
1 1 there would be some drizzling at dawn but we will have a hot day warm wet
2 2 today there are cold winds and a bit of raining or snowing at night cold wet
3 3 a sunny and clear sky is what we have today warm dry
4 4 a warm dry day warm dry