Calculate similarity within a dataframe across specific rows (R)

Question

I have a dataframe that looks something like this:

df <- data.frame("index" = 1:10, "title" = c("Sherlock","Peaky Blinders","Eastenders","BBC News", "Antiques Roadshow","Eastenders","BBC News","Casualty", "Dragons Den","Peaky Blinders"), "date" = c("01/01/20","01/01/20","01/01/20","01/01/20","01/01/20","02/01/20","02/01/20","02/01/20","02/01/20","02/01/20"))

The output looks like this:

Index  Title              Date
1      Sherlock           01/01/20
2      Peaky Blinders     01/01/20
3      Eastenders         01/01/20
4      BBC News           01/01/20
5      Antiques Roadshow  01/01/20
6      Eastenders         02/01/20
7      BBC News           02/01/20
8      Casualty           02/01/20
9      Dragons Den        02/01/20
10     Peaky Blinders     02/01/20

I want to be able to determine the number of times that a title appears on different dates. In the example above, "BBC News", "Peaky Blinders" and "Eastenders" all appear on 01/01/20 and 02/01/20. The similarity between the two dates is therefore 60% (3 out of 5 titles are identical across both dates).

It's probably also worth mentioning that the actual dataframe is much larger, and has 120 titles per day, and spans some 700 days. I need to compare the "titles" of each "date" with the previous "date" and then calculate their similarity. So to be clear, I need to determine the similarity of 01/01/20 with 02/01/20, 02/01/20 with 03/01/20, 03/01/20 with 04/01/20, and so on...

Does anyone have any idea how I might go about doing this? My eventual aim is to use Tableau to visualise similarity/difference over time, but I fear that such a calculation would be too complicated for that particular software and I'll have to somehow add it into the actual data itself.

would you please share your expected output? – nikn8 May 01 '20 at 05:32 — nikn8, May 01 '20 at 05:32

score 1 · Answer 1 · answered Apr 30 '20 at 17:58

I have come up with this solution. However, I'm unsure about how will it work when the number of records per day is different (i.e. you have 8 titles for day n and 15 titles for day n+1). I guess you would like to normalize with respect to the day with more records. Anyway, here it comes:

divide <- split.data.frame(df, as.factor(df$date))
 similarity <- vector()
 for(i in 1:(length(divide)-1)){
   index <- sum((divide[[i]]$title) %in% divide[[i+1]]$title)/max(c(length(divide[[i]]$title), length((divide[[i+1]]$title))))
   similarity <- c(similarity, index)
 }
similarity

Thanks for sharing this. I ended up going with the other solution (see above) because I have a feeling it will work better if there is any variation in the number of titles. But I really appreciate your efforts all the same :) — Japes, May 01 '20 at 10:39

score 1 · Accepted Answer · answered Apr 30 '20 at 19:03

Here is another possibility. You can create a simple function to calculate the similarity or other index between groups. Then, split your data frame by date into a list, and lapply the custom function to each in the list (final result will be a list).

calc_similar <- function(i) {
  sum(s[[i]] %in% s[[i-1]])/length(s[[i-1]])
}

s <- split(df$title, df$date)

setNames(lapply(seq_along(s)[-1], calc_similar), names(s)[-1])

Output

$`2020-01-02`
[1] 0.6

Calculate similarity within a dataframe across specific rows (R)

2 Answers2

Linked