SQL - Computing overlap between Interests

Question

I have a schema (millions of records with proper indexes in place) that looks like this:

groups    |  interests
------    |  ---------
user_id   |  user_id
group_id  |  interest_id

A user can like 0..many interests and belong to 0..many groups.

Problem: Given a group ID, I want to get all the interests for all the users that do not belong to that group, and, that share at least one interest with anyone that belongs to the same provided group.

Since the above might be confusing, here's a straightforward example (SQLFiddle):

| 1 | 2 | 3 | 4 | 5 | (User IDs)
|-------------------|
| A |   | A |   |   |
| B | B | B |   | B |
|   | C |   |   |   |
|   |   | D | D |   |

In the above example users are labeled with numbers while interests have characters.

If we assume that users 1 and 2 belong to group -1, then users 3 and 5 would be interesting:

user_id  interest_id
-------  -----------
      3            A
      3            B
      3            D
      5            B

I already wrote a dumb and very inefficient query that correctly returns the above:

SELECT * FROM "interests" WHERE "user_id" IN (
    SELECT "user_id" FROM "interests" WHERE "interest_id" IN (
        SELECT "interest_id" FROM "interests" WHERE "user_id" IN (
            SELECT "user_id" FROM "groups" WHERE "group_id" = -1
        )
    ) AND "user_id" NOT IN (
        SELECT "user_id" FROM "groups" WHERE "group_id" = -1
    )
);

But all my attempts to translate that into a proper joined query revealed themselves fruitless: either the query returns way more rows than it should or it just takes 10x as long as the sub-query, like:

SELECT "iii"."user_id" FROM "interests" AS "iii"
WHERE EXISTS
(
    SELECT "ii"."user_id", "ii"."interest_id" FROM "groups" AS "gg"
    INNER JOIN "interests" AS "ii" ON "gg"."user_id" = "ii"."user_id"
    WHERE EXISTS
    (
        SELECT "i"."interest_id" FROM "groups" AS "g"
        INNER JOIN "interests" AS "i" ON "g"."user_id" = "i"."user_id"
        WHERE "group_id" = -1 AND "i"."interest_id" = "ii"."interest_id"
    ) AND "group_id" != -1 AND "ii"."user_id" = "iii"."user_id"
);

I've been struggling trying to optimize this query for the past two nights...

Any help or insight that gets me in the right direction would be greatly appreciated. :)

PS: Ideally, one query that returns an aggregated count of common interests would be even nicer:

user_id  totalInterests  commonInterests
-------  --------------  ---------------
      3               3              1/2 (either is fine, but 2 is better)
      5               1                1

However, I'm not sure how much slower it would be compared to doing it in code.

@CL. Just exploratory benchmarks, nothing else. I'm aware that with JOINs vs. sub-queries there's not always a clear winner. But, as it turns out PhilipKelley answer is over 100x faster than my original approach (I guess mostly because of the use of CTEs). — Alix Axel, Nov 01 '15 at 23:33
Non-recursive CTEs have *no* effect on the query optimizer; they are just inserted into the `FROM cte` as a subquery. They are useful only for documentation, or for saving typing when using them multiple times. — CL., Nov 02 '15 at 08:35

score 3 · Accepted Answer · answered Oct 28 '15 at 23:19

Using the following to set up test tables

--drop table Interests  ----------------------------
CREATE TABLE Interests
 (
   InterestId  char(1)  not null
  ,UserId      int      not null
 )

INSERT Interests values
  ('A',1)
 ,('A',3)
 ,('B',1)
 ,('B',2)
 ,('B',3)
 ,('B',5)
 ,('C',2)
 ,('D',3)
 ,('D',4)


--  drop table Groups  ---------------------
CREATE TABLE Groups
 (
   GroupId  int  not null
  ,UserId   int  not null
 )

INSERT Groups values
  (-1, 1)
 ,(-1, 2)


SELECT * from Groups
SELECT * from Groups

The following query would appear to do what you want:

DECLARE @GroupId int

SET @GroupId = -1

;WITH cteGroupInterests (InterestId)
 as (--  List of the interests referenced by the target group
     select distinct InterestId
      from Groups gr
       inner join Interests nt
        on nt.UserId = gr.UserId
      where gr.GroupId = @GroupId)
--  Aggregate interests for each user
SELECT
   UserId
  ,count(OwnInterstId)      OwnInterests
  ,count(SharedInterestId)  SharedInterests
 from (--  Subquery lists all interests for each user
       select
          nt.UserId
         ,nt.InterestId   OwnInterstId
         ,cte.InterestId  SharedInterestId
        from Interests nt
         left outer join cteGroupInterests cte
          on cte.InterestId = nt.InterestId
        where not exists (--  Correlated subquery: is "this" user in the target group?)
                          select 1
                           from Groups gr
                           where gr.GroupId = @GroupId
                            and gr.UserId = nt.UserId)) xx
 group by UserId
 having count(SharedInterestId) > 0

It appears to work, but I'd want to do more elaborate tests, and I've no idea how well it'd work against millions of rows. Key points are:

cte creates a temp table referenced by the later query; building an actual temp table might be a performance boost
Correlated subqueries can be tricky, but indexes and not exists should make this pretty quick
I was lazy and left out all the underscores, sorry

Sorry it took me so long to find the time to look into this. The CTEs really do help a lot (I tried using them before but never understood exactly how), I ended up creating another CTE to hold the members of the group and the speedup was even more noticeable! Incidentally it appears that doing `WHERE nt.UserId NOT IN cteGroup` is even faster than the `NOT EXISTS` approach. Thanks a lot! — Alix Axel, Nov 01 '15 at 23:29
@AlixAxel: Be *extremely* careful with `NOT IN` and possible nulls. If `NOT IN` is substantially faster, it's likely you get *wrong* results. For details, see http://stackoverflow.com/a/11074428/484293 — blubb, Nov 02 '15 at 18:03

score 1 · Answer 2 · answered Oct 28 '15 at 23:53

This is a bit confounding. I think the best approach is exists and not exists:

select i.*
from interest i
where not exists (select 1
                  from groups g
                  where i.user_id = g.user_id and
                        g.group_id = $group_id
                 ) and
      exists (select 1
              from groups g join
                   interest i2
                   on g.user_id = i2.user_id
              where g.user_id <> i.user_user_id and
                    i.interest_id = i2.interest_id
             );

The first subquery is saying that the user is not in the group. The second is saying that the interest is shared with someone who is in the group.

SQL - Computing overlap between Interests

2 Answers2