I have two files, one is full of keywords (roughly 2,000 rows) and the other is full of text (roughly 770,000 rows). The keyword file looks like:
Event Name Keyword
All-day tabby fest tabby, all-day
All-day tabby fest tabby, fest
Maine Coon Grooming maine coon, groom
Maine Coon Grooming coon, groom
keywordFile <- tibble(EventName = c("All-day tabby fest", "All-day tabby fest", "Maine Coon Grooming","Maine Coon Grooming"), Keyword = c("tabby, all-day", "tabby, fest", "maine coon, groom", "coon, groom")
The text file looks like:
Description
Bring your tabby to the fest on Tuesday
All cats are welcome to the fest on Tuesday
Mainecoon grooming will happen at noon Wednesday
Maine coons will be pampered at noon on Wednesday
text <- tibble(Description = c("Bring your tabby to the fest on Tuesday","All cats are welcome to the fest on Tuesday","Mainecoon grooming will happen at noon Wednesday","Maine coons will be pampered at noon on Wednesday")
What I want is to iterate through the text file and look for fuzzy matches (must include each word in the "Keyword" column) and return a new column that displays TRUE or False. If that is TRUE, then I want a third column to display the event name. So something that looks like:
Description Match? Event Name
Bring your tabby to the fest on Tuesday TRUE All-day tabby fest
All cats are welcome to the fest on Tuesday FALSE
Mainecoon grooming will happen at noon Wednesday TRUE Maine Coon Grooming
Maine coons will be pampered at noon on Wednesday FALSE
I am able to successfully do my fuzzy matches (after converting everything to lowercase) with stuff like this, thanks to Molx (How can I check if multiple strings exist in another string?):
str <- c("tabby", "all-day")
myStr <- "Bring your tabby to the fest on Tuesday"
all(sapply(str, grepl, myStr))
However, I am getting stuck when I try to fuzzy match the whole files. I tried something like this:
for (i in seq_along(text$Description)){
for (j in seq_along(keywordFile$EventName)) {
# below I am creating the TRUE/FALSE column
text$TF[i] <- all(sapply(keywordFile$Keyword[j], grepl,
text$Description[i]))
if (isTRUE(text$TF))
# below I am creating the EventName column
text$EventName <- keywordFile$EventName
}
}
I don't think I'm having trouble converting the right things to vectors and strings. My keywordFile$Keyword column is a bunch of string vectors and my text$Description column is a character string. But I'm struggling with how to iterate properly through both files. The error I'm getting is
Error in ... replacement has 13 rows, data has 1
Has anyone done anything like this before?