I have a dataframe that looks something like this:
df <- data.frame("index" = 1:10, "title" = c("Sherlock","Peaky Blinders","Eastenders","BBC News", "Antiques Roadshow","Eastenders","BBC News","Casualty", "Dragons Den","Peaky Blinders"), "date" = c("01/01/20","01/01/20","01/01/20","01/01/20","01/01/20","02/01/20","02/01/20","02/01/20","02/01/20","02/01/20"))
The output looks like this:
Index Title Date
1 Sherlock 01/01/20
2 Peaky Blinders 01/01/20
3 Eastenders 01/01/20
4 BBC News 01/01/20
5 Antiques Roadshow 01/01/20
6 Eastenders 02/01/20
7 BBC News 02/01/20
8 Casualty 02/01/20
9 Dragons Den 02/01/20
10 Peaky Blinders 02/01/20
I want to be able to determine the number of times that a title appears on different dates. In the example above, "BBC News", "Peaky Blinders" and "Eastenders" all appear on 01/01/20 and 02/01/20. The similarity between the two dates is therefore 60% (3 out of 5 titles are identical across both dates).
It's probably also worth mentioning that the actual dataframe is much larger, and has 120 titles per day, and spans some 700 days. I need to compare the "titles" of each "date" with the previous "date" and then calculate their similarity. So to be clear, I need to determine the similarity of 01/01/20 with 02/01/20, 02/01/20 with 03/01/20, 03/01/20 with 04/01/20, and so on...
Does anyone have any idea how I might go about doing this? My eventual aim is to use Tableau to visualise similarity/difference over time, but I fear that such a calculation would be too complicated for that particular software and I'll have to somehow add it into the actual data itself.