I have two dataframes. one (txt.df) has a column with a text I want to extract phrases from (text). The other (wrd.df) has a column with the phrases (phrase). both are big dataframes with complex texts and strings but lets say:
txt.df <- data.frame(id = c(1, 2, 3, 4, 5),
text = c("they love cats and dogs", "he is drinking juice",
"the child is having a nap on the bed", "they jump on the bed and break it",
"the cat is sleeping on the bed"))
wrd.df <- data.frame(label = c('a', 'b', 'c', 'd', 'e', 'd'),
phrase = c("love cats", "love dogs", "juice drinking", "nap on the bed", "break the bed",
"sleeping on the bed"))
what I finally need is a txt.df with another column which contains labels of the phrases detected.
what I tried was creating a column in wrd.df in which I tokenized the phrases like this
wrd.df$token <- sapply(wrd.df$phrase, function(x) unlist(strsplit(x, split = " ")))
and then tried to write a custom function to sapply over the tokens column with grepl/str_detect get the names (labels) of those which were all true
Extract.Fun <- function(text, df, label, token){
for (i in token) {
truefalse[i] <- sapply(token[i], function (x) grepl(x, text))
truenames[i] <- names(which(truefalse[i] == T))
removedup[i] <- unique(truenames[i])
return(removedup)
}
and then sapply this custom function on my txt.df$text to have a new column with the labels.
txt.df$extract <- sapply(txt.df$text, function (x) Extract.Fun(x, wrd.df, "label", "token"))
but I'm not good with custom functions and am really stuck. I would appreciate any help. P.S. It would be very good if i could also have partial matches like "drink juice" and "broke the bed"... but it's not a priority... fine with the original ones.