SQL query to find a row with a specific number of associations

Question

Using Postgres I have a schema that has conversations and conversationUsers. Each conversation has many conversationUsers. I want to be able to find the conversation that has the exactly specified number of conversationUsers. In other words, provided an array of userIds (say, [1, 4, 6]) I want to be able to find the conversation that contains only those users, and no more.

So far I've tried this:

SELECT c."conversationId"
FROM "conversationUsers" c
WHERE c."userId" IN (1, 4)
GROUP BY c."conversationId"
HAVING COUNT(c."userId") = 2;

Unfortunately, this also seems to return conversations which include these 2 users among others. (For example, it returns a result if the conversation also includes "userId" 5).

It would be helpful to provide your version of Postgres, minimum table definition and some sample rows to test with. — Erwin Brandstetter, Nov 10 '18 at 03:23
Hi. This is a faq. Please always google many clear, concise & specific versions/phrasings of your question/problem/goal with & without your particular strings/names & read many answers. Add relevant keywords you discover to your searches. If you don't find an answer then post, using 1 variant search as title & keywords for tags. See the downvote arrow mouseover text. When you do have a non-duplicate code question to post please read & act on [mcve]. — philipxy, Nov 10 '18 at 07:22

Erwin Brandstetter · Accepted Answer · 2022-09-18T19:57:11.480

This is a case of relational-division - with the added special requirement that the same conversation shall have no additional users.

Assuming the PK of table "conversationUsers" is on ("userId", "conversationId"), which enforces uniqueness of combinations, NOT NULL and also provides the essential index for performance implicitly. Columns of the multicolumn PK in this order. Ideally, you have another index on ("conversationId", "userId"). See:

Is a composite index also good for queries on the first field?

For the basic query, there is the "brute force" approach to count the number of matching users for all conversations of all given users and then filter the ones matching all given users. OK for small tables and/or only short input arrays and/or few conversations per user, but doesn't scale well:

SELECT "conversationId"
FROM   "conversationUsers" c
WHERE  "userId" = ANY ('{1,4,6}'::int[])
GROUP  BY 1
HAVING count(*) = array_length('{1,4,6}'::int[], 1)
AND    NOT EXISTS (
   SELECT FROM "conversationUsers"
   WHERE  "conversationId" = c."conversationId"
   AND    "userId" <> ALL('{1,4,6}'::int[])
   );

Eliminating conversations with additional users with a NOT EXISTS anti-semi-join. More:

How do I (or can I) SELECT DISTINCT on multiple columns?

Alternative techniques:

Select rows which are not present in other table

There are various other, (much) faster relational-division query techniques. But the fastest ones are not well suited for a dynamic number of user IDs.

How to filter SQL results in a has-many-through relation

For a fast query that can also deal with a dynamic number of user IDs, consider a recursive CTE:

WITH RECURSIVE rcte AS (
   SELECT "conversationId", 1 AS idx
   FROM   "conversationUsers"
   WHERE  "userId" = ('{1,4,6}'::int[])[1]

   UNION ALL
   SELECT c."conversationId", r.idx + 1
   FROM   rcte                r
   JOIN   "conversationUsers" c USING ("conversationId")
   WHERE  c."userId" = ('{1,4,6}'::int[])[idx + 1]
   )
SELECT "conversationId"
FROM   rcte r
WHERE  idx = array_length(('{1,4,6}'::int[]), 1)
AND    NOT EXISTS (
   SELECT FROM "conversationUsers"
   WHERE  "conversationId" = r."conversationId"
   AND    "userId" <> ALL('{1,4,6}'::int[])
   );

For ease of use wrap this in a function or prepared statement. Like:

PREPARE conversations(int[]) AS
WITH RECURSIVE rcte AS (
   SELECT "conversationId", 1 AS idx
   FROM   "conversationUsers"
   WHERE  "userId" = $1[1]

   UNION ALL
   SELECT c."conversationId", r.idx + 1
   FROM   rcte                r
   JOIN   "conversationUsers" c USING ("conversationId")
   WHERE  c."userId" = $1[idx + 1]
   )
SELECT "conversationId"
FROM   rcte r
WHERE  idx = array_length($1, 1)
AND    NOT EXISTS (
   SELECT FROM "conversationUsers"
   WHERE  "conversationId" = r."conversationId"
   AND    "userId" <> ALL($1);

Call:

EXECUTE conversations('{1,4,6}');

db<>fiddle here (also demonstrating a function)

There is still room for improvement: to get top performance you have to put users with the fewest conversations first in your input array to eliminate as many rows as possible early. To get top performance you can generate a non-dynamic, non-recursive query dynamically (using one of the fast techniques from the first link) and execute that in turn. You could even wrap it in a single plpgsql function with dynamic SQL ...

More explanation:

Using same column multiple times in WHERE clause

Alternative: MV for sparsely written table

If the table "conversationUsers" is mostly read-only (old conversations are unlikely to change) you might use a MATERIALIZED VIEW with pre-aggregated users in sorted arrays and create a plain btree index on that array column.

CREATE MATERIALIZED VIEW mv_conversation_users AS
SELECT "conversationId", array_agg("userId") AS users  -- sorted array
FROM (
   SELECT "conversationId", "userId"
   FROM   "conversationUsers"
   ORDER  BY 1, 2
   ) sub
GROUP  BY 1
ORDER  BY 1;

CREATE INDEX ON mv_conversation_users (users) INCLUDE ("conversationId");

The demonstrated covering index requires Postgres 11. See:

https://dba.stackexchange.com/a/207938/3684

About sorting rows in a subquery:

How to apply ORDER BY and LIMIT in combination with an aggregate function?

In older versions use a plain multicolumn index on (users, "conversationId"). With very long arrays, a hash index might make sense in Postgres 10 or later.

Then the much faster query would simply be:

SELECT "conversationId"
FROM   mv_conversation_users c
WHERE  users = '{1,4,6}'::int[];  -- sorted array!

db<>fiddle here

You have to weigh added costs to storage, writes and maintenance against benefits to read performance.

Aside: consider legal identifiers without double quotes. conversation_id instead of "conversationId" etc.:

Are PostgreSQL column names case-sensitive?

@HummingBird24: My apologies, I butchered the assumption about the PK in my previous edit, so it wasn't obvious any more, why it works like that. Consider the edit fixing that. Given the PK, the filter in the `HAVING` clause eliminates conversations with only a subset of the given users, e.g. `'{1,6}'` when looking for `'{1,4,6}'`. `having count() = n` acts like `having count() >= n` (not like `having count() > n`!) because after `WHERE "userId" = ANY ('{1,4,6}'::int[])` there can never be more than `array_length('{1,4,6}'::int[], 1)` matches. — Erwin Brandstetter, Sep 18 '22 at 20:00
Ahh okay I understand now. Thanks for the clarification! And yes `>=` not `>` that was a typo on my part. — HummingBird24, Sep 21 '22 at 06:48

score 1 · Answer 2 · answered Nov 10 '18 at 02:21

you can modify your query like this and it should work:

SELECT c."conversationId"
FROM "conversationUsers" c
WHERE c."conversationId" IN (
    SELECT DISTINCT c1."conversationId"
    FROM "conversationUsers" c1
    WHERE c1."userId" IN (1, 4)
    )
GROUP BY c."conversationId"
HAVING COUNT(DISTINCT c."userId") = 2;

score 1 · Answer 3 · answered Nov 10 '18 at 03:17

This might be easier to follow. You want the conversation ID, group by it. add the HAVING clause based on the sum of matching user IDs count equal to the all possible within the group. This will work, but will be longer to process because of no pre-qualifier.

select
      cu.ConversationId
   from
      conversationUsers cu
   group by
      cu.ConversationID
   having 
      sum( case when cu.userId IN (1, 4) then 1 else 0 end ) = count( distinct cu.UserID )

To Simplify the list even more, have a pre-query of conversations that at least one person is in... If they are not in to begin with, why bother considering such other conversations.

select
      cu.ConversationId
   from
      ( select cu2.ConversationID
           from conversationUsers cu2
           where cu2.userID = 4 ) preQual
      JOIN conversationUsers cu
         preQual.ConversationId = cu.ConversationId
   group by
      cu.ConversationID
   having 
      sum( case when cu.userId IN (1, 4) then 1 else 0 end ) = count( distinct cu.UserID )

SQL query to find a row with a specific number of associations

3 Answers3

Alternative: MV for sparsely written table

Linked