Approximate string matching in R between two datasets

Question

I have the following dataset containing film titles and the corresponding genre, while another dataset contains plain text where these titles might be quoted or not:

dt1

   title                                        genre

   Secret in Their Eyes                         Dramas
   V for Vendetta                               Action & Adventure
   Bottersnikes & Gumbles                       Kids' TV
   ...                                          ...

and

dt2

id      Text
1.      "I really liked V for Vendetta"
2       "Bottersnikes & Gumbles was a great film .... "
3.      " In any case, in my opinion bottersnikes &gumbles was a great film ..."
4       "@thewitcher was an interesting series
5       "Secret in Their Eye is a terrible film! but I Like V per Vendetta" 
... etc

what I want to obtain is a function that matched those titles in dt1 and tries to find them in the text in dt2:

if it finds any match or approximate match I want to have a column in dt2 that tells with the title that was mentioned in the text. if more than one is mentioned I want a any titles separated by a comma.

dt2

id      Text                                                                       mentions
1.      "I really liked V for Vendetta"                                            "V for Vendetta"
2       "Bottersnikes & Gumbles was a great film .... "                            "Bottersnikes & Gumbles"
3.      " In any case, in my opinion bottersnikes &gumbles was a great film ..."   "Bottersnikes & Gumbles"
4       "@thewitcher was an interesting series                                       NA
5       "Secret in Their Eye is a terrible film! but I Like V per Vendetta"          "Secret in Their Eyes, V for Vendetta" 
... etc

See https://stackoverflow.com/questions/59722865/how-to-do-fuzzy-pattern-matching-with-quanteda-and-kwic — Ken Benoit, Apr 17 '20 at 11:11
@KenBenoit thank you. However my list of titles has more than 1000 items and I'd need a way to do this process for each of those titles — Carbo, Apr 17 '20 at 11:26

score 4 · Accepted Answer · answered Apr 17 '20 at 14:21

You can do the fuzzy matching via agrep(), which here I've used for each title with lapply() to generate a logical vector of matches for each Text, and then used an apply() across a data.frame from this match to create the vector of matched titles.

You can tweak the max.distance value but this worked just fine on your example.

dt1 <- data.frame(
  title = c("Secret in Their Eyes", "V for Vendetta", "Bottersnikes & Gumbles"),
  genre = c("Dramas", "Action & Adventure", "Kids' TV"),
  stringsAsFactors = FALSE
)

dt2 <- data.frame(
  id = 1:5,
  Text = c(
    "I really liked V for Vendetta",
    "Bottersnikes & Gumbles was a great film .... ",
    "In any case, in my opinion bottersnikes &gumbles was a great film ...",
    "@thewitcher was an interesting series",
    "Secret in Their Eye is a terrible film! but I Like V per Vendetta"
  ),
  stringsAsFactors = FALSE
)

match_titles <- function(target, titles) {
  matches <- lapply(titles, agrepl, target,
    max.distance = 0.3,
    ignore.case = TRUE, fixed = TRUE
  )
  matched_titles <- apply(
    data.frame(matches), 1,
    function(y) paste(titles[y], collapse = ",")
  )
  matched_titles
}

dt2$titles <- match_titles(dt2$Text, dt1$title)
dt2
##   id                                                                  Text
## 1  1                                         I really liked V for Vendetta
## 2  2                         Bottersnikes & Gumbles was a great film .... 
## 3  3 In any case, in my opinion bottersnikes &gumbles was a great film ...
## 4  4                                 @thewitcher was an interesting series
## 5  5     Secret in Their Eye is a terrible film! but I Like V per Vendetta
##                                titles
## 1                      V for Vendetta
## 2              Bottersnikes & Gumbles
## 3              Bottersnikes & Gumbles
## 4                                    
## 5 Secret in Their Eyes,V for Vendetta

Approximate string matching in R between two datasets

1 Answers1