Jaccard similarity computation in hive

Question

I had the data like below:

CustomerId   Category
  100            2
  100            2
  100            3
  100            6
  100            4
  200            3
  200            6
  200            7
  300            2

So the output I want is the Jaccard Similarity Index:

Jaccard(100,200) = 2 (items shared) / 5 (union of items)
Jaccard(100,300) = 1 (items shared) / 4 (union of items)
Jaccard(200,300) = 0 (items shared) / 4 (union of items).

What I tried initially is to find the Union and intersection of terms but I am not sure if it is the most efficient way. Also I want to avoid duplicates like Jaccard(100,300) and Jaccard(300,100) appearing together. Can someone help with this?

     select t1.customer_id, t2.customer_id,
              sum(case when t1.category_id = t2.category_id then 1 else 0 end) intersection,
sum(case when t1.category = t2.category then 1
         when t1.category <> t2.category then 1 else 0 end)
union
    from t t1 cross join
         t t2
  Where t1.customer_id <> t2.customer_id
    group by t1.customer_id, t2.customer_id

Unfortunately I also checked that I have one customer buying multiple items in the same category. So I edited the table to reflect that Customer 100 has two items in Category 2. However, it should not change the Jaccard similarity metric values.

so the similarity for (100,200) should be 2/5..not 2/4..correct? — Vamsi Prabhala, May 22 '17 at 17:48
Use `t1.customer_id < t2.customer_id`; that avoids `(100,100)` and `(300,100)`. — Jonathan Leffler, May 22 '17 at 18:16
I don't know whether [How to select similar sets in SQL](https://stackoverflow.com/questions/13132270/how-to-select-similar-sets-in-sql/) adds anything for you — probably not. — Jonathan Leffler, May 22 '17 at 18:24
In Hive the above does not work as non-equality joins don't work — vkaul11, May 23 '17 at 23:34

Vamsi Prabhala · Accepted Answer · 2017-05-24T13:16:33.963

You don't need a cross join. Get the denominator by computing the total of distinct category_id's for a pair and subtracting the intersecting category_id's from it.

SELECT t1.customer_id AS id1,
       t2.customer_id AS id2,
       1.0*sum(CASE WHEN t1.category_id = t2.category_id THEN 1 ELSE 0 END)
     / (count(DISTINCT t1.category_id)+count(DISTINCT t2.category_id)-sum(CASE WHEN t1.category_id = t2.category_id THEN 1 ELSE 0 END)) AS jaccard_similarity
FROM t t1
JOIN t t2 ON t1.customer_id<t2.customer_id
GROUP BY t1.customer_id, t2.customer_id

If inequalities aren't supported in join, use

SELECT t1.customer_id AS id1,
       t2.customer_id AS id2,
       1.0*sum(CASE WHEN t1.category_id = t2.category_id THEN 1 ELSE 0 END)
       / (count(DISTINCT t1.category_id)+count(DISTINCT t2.category_id)-sum(CASE WHEN t1.category_id = t2.category_id THEN 1 ELSE 0 END)) AS jaccard_similarity
FROM t t1
CROSS JOIN t t2 
WHERE t1.customer_id<t2.customer_id
GROUP BY t1.customer_id, t2.customer_id

If you just need the intersection counts across pairs, the query below should suffice.

select t1.customer_id as id1, t2.customer_id as id2
,sum(case when t1.category_id = t2.category_id then 1 else 0 end) as intersection
from t t1 
join t t2 on t1.customer_id<t2.customer_id
group by t1.customer_id, t2.customer_id

Edit: Based on OP's comment that a customer can have the same category multiple times but should be counted only once.

SELECT t1.customer_id AS id1,
       t2.customer_id AS id2,
       1.0*COUNT(DISTINCT CASE WHEN t1.category_id = t2.category_id THEN t1.category_id END)
       / (COUNT(DISTINCT t1.category_id)+COUNT(DISTINCT t2.category_id)
         -COUNT(DISTINCT CASE WHEN t1.category_id = t2.category_id THEN t1.category_id END)) AS jaccard_similarity
FROM t t1
CROSS JOIN t t2 
WHERE t1.customer_id<t2.customer_id
GROUP BY t1.customer_id, t2.customer_id

what doesn't work? inequality in join? i edited the answer with a `cross join`..it should work — Vamsi Prabhala, May 23 '17 at 23:34
the problem is Jaccard similarity comes out to > 1 actually in my case because I have same customer purchasing multiple items. I guess the intersection is not computed accurately. — vkaul11, May 24 '17 at 07:05
@vkaul11 .. i edited the answer to count only distinct items per pair. this should give the expected results. — Vamsi Prabhala, May 24 '17 at 11:53

Jaccard similarity computation in hive

1 Answers1