2

I am trying to merge two data sets based on movie title column that contains movie names using fuzzy string matching. A sample from the 2 data sets are given below.

The first data set looks like

  itemid userid rating       time                              title release_date
99995    1677    854      3 1997-12-22                      sweet nothing         1995
99996    1678    863      1 1998-03-07                         mat' i syn         1997
99997    1679    863      3 1998-03-07                          b. monkey         1998
99998    1429    863      2 1998-03-07                      sliding doors         1998
99999    1681    896      3 1998-02-11                       you so crazy         1994
100000   1682    916      3 1997-11-29 scream of stone (schrei aus stein)         1991

The second one is

itemid userid rating       time                     title release_date
117201 3175936   9140      3 2013-09-22 bei tou zou de na wu nian         2013
117202 3175936  17439      3 2013-09-18 bei tou zou de na wu nian         2013
117203 3181128   3024      5 2013-09-13                mac & jack         2013
117204 3181962  17310      5 2013-09-19         the last shepherd         2013
117205 3188690  13551      5 2013-09-17     the making of a queen         2013
117206 3198468   5338      3 2013-09-22          north 24 kaatham         2013

dput - df1

structure(list(itemid = c(1677L, 1678L, 1679L, 1429L, 1681L, 
1682L), userid = c(854L, 863L, 863L, 863L, 896L, 916L), rating = c(3L, 
1L, 3L, 2L, 3L, 3L), time = structure(c(10217, 10292, 10292, 
10292, 10268, 10194), class = "Date"), title = c("sweet nothing", 
"mat' i syn", "b. monkey", "sliding doors", "you so crazy", "scream of stone (schrei aus stein)"
), release_date = c("1995", "1997", "1998", "1998", "1994", "1991"
)), .Names = c("itemid", "userid", "rating", "time", "title", 
"release_date"), row.names = 99995:100000, class = "data.frame")

dput - df2

structure(list(itemid = c(3175936L, 3175936L, 3181128L, 3181962L, 
3188690L, 3198468L), userid = c(9140L, 17439L, 3024L, 17310L, 
13551L, 5338L), rating = c(3, 3, 5, 5, 5, 3), time = structure(c(15970, 
15966, 15961, 15967, 15965, 15970), class = "Date"), title = c("bei tou zou de na wu nian", 
"bei tou zou de na wu nian", "mac & jack", "the last shepherd", 
"the making of a queen", "north 24 kaatham"), release_date = c("2013", 
"2013", "2013", "2013", "2013", "2013")), .Names = c("itemid", 
"userid", "rating", "time", "title", "release_date"), row.names = 117201:117206, class = "data.frame")

I want to fuzzy match the titles in the two data sets using levenshteinSim and for the titles where the similarity is more than 0.85 for example, extract the information for that movie from both data sets into a new data set. At the same time I need to check if the matched titles have the same release date as movies with exactly the same name can have more than one release date.

Can anyone guide me on how to get this task done?

I have tried the below code so far:

df <- sapply(df1$title,lenvenshteinSim,df2$title)

This gives a matrix with dimensions 11451 X 1682. where each column is a single movie title from the 1st data frame and rows contain the similarity value. I can potentially put a loop in here or may be look at melt & dcast to pull out the columns with max(similary) >0.85 but this does not look like an efficient way. Plus, I am not able to match the release date in this code.

Kenster
  • 23,465
  • 21
  • 80
  • 106
syebill
  • 543
  • 6
  • 23
  • could you add the output of `dput(df)` for your data sets to the question? – Marat Talipov Jan 29 '15 at 16:13
  • done. Although in the data there are no matching movies in the samples but in the real data set there are matching movies. for example if lets say levenshtein-similarity > 0.80 and release dates are same - extract the info into a new df – syebill Jan 29 '15 at 16:27

1 Answers1

2

You can merge these data frames

z <- merge(df1,df2,by='release_date',suffixes=c('.df1','.df2'))

which will give you a cartesian product (i.e. all possible combinations between df1 and df2 for the same release_date, and then calculate the Levenshtein distance by:

z$L.dist <- lenvenshteinSim(z$title.df1,z$title.df2)

Having z$L.dist, you can filter the desired rows:

subset(z,L.dist > 0.85)

Update

Here is a similar approach using data.table, which might be a faster alternative:

library(data.table)
d1 <- as.data.table(df1)
d2 <- as.data.table(df2)
setkey(d1,release_date)
setkey(d2,release_date)

z <- d1[d2,allow.cartesian=T,nomatch=F]

#z[,L.dist:=lenvenshteinSim(title,i.title)]
z[,L.dist:=mapply(lenvenshteinSim,title,i.title)]


z[L.dist > 0.8]
Marat Talipov
  • 13,064
  • 5
  • 34
  • 53
  • Your approach is totally different and very good. I wanted to ask if this can be applied to data sets with more than 100K rows each? – syebill Jan 29 '15 at 17:37
  • I was going to ask you about that! :) I guess the algorithm by itself should be quite effective, so I'd suggest to give it a try. – Marat Talipov Jan 29 '15 at 17:41
  • I tried it and came up with this error "Error: cannot allocate vector of size 266.4 Mb In addition: Warning messages: 1: In `[.data.frame`(x, c(m$xi, if (all.x) m$x.alone), c(by.x, seq_len(ncx)[-by.x]), : Reached total allocation of 5942Mb: see help(memory.size) 2: In `[.data.frame`(x, c(m$xi, if (all.x) m$x.alone), c(by.x, seq_len(ncx)[-by.x]), : Reached total allocation of 5942Mb: see help(memory.size) 3: In `[.data.frame`(x, c(m$xi, if (all.x) m$x.alone), c(by.x, seq_len(ncx)[-by.x]), : Reached total allocation of 5942Mb: see help(memory.size) – syebill Jan 29 '15 at 17:53
  • Can I reduce the dimensions of the data sets? as I do not need all the columns in the resulting data set but only need the movies and ratings and possibly release data although not necessary. – syebill Jan 29 '15 at 17:58
  • Yeah, that's a great idea. Try something like `df1.fc <- df[,c('release_date','rating','title')]` – Marat Talipov Jan 29 '15 at 18:04
  • You can also try a `data.table` solution (see in the post) – Marat Talipov Jan 29 '15 at 18:31
  • Your solution using data.table speeded up the process but still R is crashing when it tries to evaluate the levenshtein distance.z[,L.dist:=lenvenshteinSim(title,i.title)] <- on this statement. Could it be something to do with my machine or memory? – syebill Jan 30 '15 at 00:59
  • I don't know exactly the output of the `lenvenshteinSim` function (BTW, what package it is from), and I suspect it might produce a matrix instead of vector, which could cause an error for the `data.table` solution. I've modified the `data.table` solution to account for this assumption-- could you try and see if it works now? – Marat Talipov Jan 30 '15 at 03:11
  • levenshteinSim is from a package called "RecordLinkage" and i believe it produces a matrix if multiple matches are made. – syebill Jan 30 '15 at 13:31
  • @MaratTalipov,I was looking for some suggestions on how we can get the data for some specific equivalent column of the match from 2nd dataset. I have posted the question at - http://stackoverflow.com/questions/42749447/r-fuzzy-string-match-to-return-specific-column-based-on-matched-string It would be great help if you can suggest something – user1412 Mar 12 '17 at 17:14