I am trying to merge two data sets based on movie title column that contains movie names using fuzzy string matching. A sample from the 2 data sets are given below.
The first data set looks like
itemid userid rating time title release_date
99995 1677 854 3 1997-12-22 sweet nothing 1995
99996 1678 863 1 1998-03-07 mat' i syn 1997
99997 1679 863 3 1998-03-07 b. monkey 1998
99998 1429 863 2 1998-03-07 sliding doors 1998
99999 1681 896 3 1998-02-11 you so crazy 1994
100000 1682 916 3 1997-11-29 scream of stone (schrei aus stein) 1991
The second one is
itemid userid rating time title release_date
117201 3175936 9140 3 2013-09-22 bei tou zou de na wu nian 2013
117202 3175936 17439 3 2013-09-18 bei tou zou de na wu nian 2013
117203 3181128 3024 5 2013-09-13 mac & jack 2013
117204 3181962 17310 5 2013-09-19 the last shepherd 2013
117205 3188690 13551 5 2013-09-17 the making of a queen 2013
117206 3198468 5338 3 2013-09-22 north 24 kaatham 2013
dput - df1
structure(list(itemid = c(1677L, 1678L, 1679L, 1429L, 1681L,
1682L), userid = c(854L, 863L, 863L, 863L, 896L, 916L), rating = c(3L,
1L, 3L, 2L, 3L, 3L), time = structure(c(10217, 10292, 10292,
10292, 10268, 10194), class = "Date"), title = c("sweet nothing",
"mat' i syn", "b. monkey", "sliding doors", "you so crazy", "scream of stone (schrei aus stein)"
), release_date = c("1995", "1997", "1998", "1998", "1994", "1991"
)), .Names = c("itemid", "userid", "rating", "time", "title",
"release_date"), row.names = 99995:100000, class = "data.frame")
dput - df2
structure(list(itemid = c(3175936L, 3175936L, 3181128L, 3181962L,
3188690L, 3198468L), userid = c(9140L, 17439L, 3024L, 17310L,
13551L, 5338L), rating = c(3, 3, 5, 5, 5, 3), time = structure(c(15970,
15966, 15961, 15967, 15965, 15970), class = "Date"), title = c("bei tou zou de na wu nian",
"bei tou zou de na wu nian", "mac & jack", "the last shepherd",
"the making of a queen", "north 24 kaatham"), release_date = c("2013",
"2013", "2013", "2013", "2013", "2013")), .Names = c("itemid",
"userid", "rating", "time", "title", "release_date"), row.names = 117201:117206, class = "data.frame")
I want to fuzzy match the titles in the two data sets using levenshteinSim and for the titles where the similarity is more than 0.85 for example, extract the information for that movie from both data sets into a new data set. At the same time I need to check if the matched titles have the same release date as movies with exactly the same name can have more than one release date.
Can anyone guide me on how to get this task done?
I have tried the below code so far:
df <- sapply(df1$title,lenvenshteinSim,df2$title)
This gives a matrix with dimensions 11451 X 1682. where each column is a single movie title from the 1st data frame and rows contain the similarity value. I can potentially put a loop in here or may be look at melt & dcast to pull out the columns with max(similary) >0.85 but this does not look like an efficient way. Plus, I am not able to match the release date in this code.