I had the data like below:
CustomerId Category
100 2
100 2
100 3
100 6
100 4
200 3
200 6
200 7
300 2
So the output I want is the Jaccard Similarity Index:
- Jaccard(100,200) = 2 (items shared) / 5 (union of items)
- Jaccard(100,300) = 1 (items shared) / 4 (union of items)
- Jaccard(200,300) = 0 (items shared) / 4 (union of items).
What I tried initially is to find the Union and intersection of terms but I am not sure if it is the most efficient way. Also I want to avoid duplicates like Jaccard(100,300) and Jaccard(300,100) appearing together. Can someone help with this?
select t1.customer_id, t2.customer_id,
sum(case when t1.category_id = t2.category_id then 1 else 0 end) intersection,
sum(case when t1.category = t2.category then 1
when t1.category <> t2.category then 1 else 0 end)
union
from t t1 cross join
t t2
Where t1.customer_id <> t2.customer_id
group by t1.customer_id, t2.customer_id
Unfortunately I also checked that I have one customer buying multiple items in the same category. So I edited the table to reflect that Customer 100 has two items in Category 2. However, it should not change the Jaccard similarity metric values.