0

I had the data like below:

CustomerId   Category
  100            2
  100            2
  100            3
  100            6
  100            4
  200            3
  200            6
  200            7
  300            2

So the output I want is the Jaccard Similarity Index:

  • Jaccard(100,200) = 2 (items shared) / 5 (union of items)
  • Jaccard(100,300) = 1 (items shared) / 4 (union of items)
  • Jaccard(200,300) = 0 (items shared) / 4 (union of items).

What I tried initially is to find the Union and intersection of terms but I am not sure if it is the most efficient way. Also I want to avoid duplicates like Jaccard(100,300) and Jaccard(300,100) appearing together. Can someone help with this?

     select t1.customer_id, t2.customer_id,
              sum(case when t1.category_id = t2.category_id then 1 else 0 end) intersection,
sum(case when t1.category = t2.category then 1
         when t1.category <> t2.category then 1 else 0 end)
union
    from t t1 cross join
         t t2
  Where t1.customer_id <> t2.customer_id
    group by t1.customer_id, t2.customer_id

Unfortunately I also checked that I have one customer buying multiple items in the same category. So I edited the table to reflect that Customer 100 has two items in Category 2. However, it should not change the Jaccard similarity metric values.

vkaul11
  • 4,098
  • 12
  • 47
  • 79

1 Answers1

1

You don't need a cross join. Get the denominator by computing the total of distinct category_id's for a pair and subtracting the intersecting category_id's from it.

SELECT t1.customer_id AS id1,
       t2.customer_id AS id2,
       1.0*sum(CASE WHEN t1.category_id = t2.category_id THEN 1 ELSE 0 END)
     / (count(DISTINCT t1.category_id)+count(DISTINCT t2.category_id)-sum(CASE WHEN t1.category_id = t2.category_id THEN 1 ELSE 0 END)) AS jaccard_similarity
FROM t t1
JOIN t t2 ON t1.customer_id<t2.customer_id
GROUP BY t1.customer_id, t2.customer_id

If inequalities aren't supported in join, use

SELECT t1.customer_id AS id1,
       t2.customer_id AS id2,
       1.0*sum(CASE WHEN t1.category_id = t2.category_id THEN 1 ELSE 0 END)
       / (count(DISTINCT t1.category_id)+count(DISTINCT t2.category_id)-sum(CASE WHEN t1.category_id = t2.category_id THEN 1 ELSE 0 END)) AS jaccard_similarity
FROM t t1
CROSS JOIN t t2 
WHERE t1.customer_id<t2.customer_id
GROUP BY t1.customer_id, t2.customer_id

If you just need the intersection counts across pairs, the query below should suffice.

select t1.customer_id as id1, t2.customer_id as id2
,sum(case when t1.category_id = t2.category_id then 1 else 0 end) as intersection
from t t1 
join t t2 on t1.customer_id<t2.customer_id
group by t1.customer_id, t2.customer_id

Edit: Based on OP's comment that a customer can have the same category multiple times but should be counted only once.

SELECT t1.customer_id AS id1,
       t2.customer_id AS id2,
       1.0*COUNT(DISTINCT CASE WHEN t1.category_id = t2.category_id THEN t1.category_id END)
       / (COUNT(DISTINCT t1.category_id)+COUNT(DISTINCT t2.category_id)
         -COUNT(DISTINCT CASE WHEN t1.category_id = t2.category_id THEN t1.category_id END)) AS jaccard_similarity
FROM t t1
CROSS JOIN t t2 
WHERE t1.customer_id<t2.customer_id
GROUP BY t1.customer_id, t2.customer_id 
Vamsi Prabhala
  • 48,685
  • 4
  • 36
  • 58