Given a set of IDs, Return the Subset of Orders with only those IDs

Question

Given a set of product_ids, what are the order_ids that only have those product_ids?

For the example below, I'm only wanting order_ids that have some combination of (a,b,c). I have 2 tables like below:

"transactions" table:

order_id | product_id |
---------+-------------
    1    |    a       |
    1    |    b       |
    2    |    a       |
    2    |    X       |
    3    |    a       |
    3    |    b       |
    3    |    c       |
    ...  |    ...     |
    999  |    Y       |

"products" table:

product_id |
------------
     a     |
     b     |
     c     |
     d     |
     X     |
     Y     |
     ...   |
     ZZZ   |

Desired Output has 2 order_ids with expected table output:

order_id |
----------
    1    |
    3    |

Notice that order_id == 2 is removed although it has product_id == a but because it has product_id == X then it should be removed.

Therefore it's not a simple:

SELECT DISTINCT(order_id)
FROM transactions
WHERE product_id IN (a, b, c)

Unrelated to your question, but: `distinct` is **not** a function. It always applies to all columns in the select list. Enclosing one of the columns with parentheses won't change anything and is useless. `distinct (a),b` is the same as `distinct a,(b)` or `distinct a,b` — , Jan 26 '21 at 08:16

Erwin Brandstetter · Accepted Answer · 2021-01-26T05:02:40.603

Typically, there is an orders table to go with that, with exactly one row per order.

If we can further assume that there is always at least one transaction for every order, this would do the job:

SELECT o.id
FROM   orders o
WHERE  NOT EXISTS (
   SELECT FROM transactions  -- SELECT list can be empty for EXISTS test
   WHERE  order_id = o.id
   AND    product_id <> ALL ('{a,b,c}')
   );

That's good for very common product_id's or long lists.

For short lists or rare products, it will be faster to start with a positive selection first. Like:

SELECT order_id
FROM  (
   SELECT DISTINCT order_id
   FROM   transactions
   WHERE  product_id = ANY ('{a,b,c}')
   ) t
WHERE  NOT EXISTS (
   SELECT FROM transactions
   WHERE  order_id = t.order_id
   AND    product_id <> ALL ('{a,b,c}')
   );

An index on (product_id) is essential for performance. Better yet, a multicolumn index on (product_id, order_id), plus another one on (order_id, product_id). See:

Is a composite index also good for queries on the first field?

The manual about array literals:

https://www.postgresql.org/docs/current/arrays.html#ARRAYS-INPUT

About the ANY and ALL constructs:

`orders` is down for a week :/ How would you solve without a 3rd table? — mitchell-reynolds, Jan 26 '21 at 04:38
First time I'm seeing usage of `('{a,b,c}')`, does the syntax allow for strings with spaces or odd characters? Say something like `('{a,b,c, a-b-c}')` where "a-b-c" is a single ID — mitchell-reynolds, Jan 26 '21 at 04:45
@mitchell-reynolds: That's an array *literal*. Strings with spaces must be enclosed in double-quotes. I added some links above. — Erwin Brandstetter, Jan 26 '21 at 04:51

Charlieface · Answer 2 · 2021-01-26T04:36:33.597

1

We need to define what the opposite of your requirement is and filter against that. So, which orders have at least one transaction not in a,b,c. We count the number of such transactions in the order grouping, and filter out orders with COUNT > 0, only returning the ones with COUNT = 0.

SELECT order_id
FROM transactions
GROUP BY order_id
HAVING COUNT(CASE WHEN product_id NOT IN (a, b, c) THEN 1 END) = 0

If you had a,b,c as a list of products in another table and you wanted to filter against that rather than hard-coding into the query, then it is slightly more complicated:

SELECT order_id
FROM transactions AS t
LEFT JOIN listOfProducts AS l ON l.product_id = t.product_id
GROUP BY order_id
HAVING COUNT(CASE WHEN l.product_id IS NULL THEN 1 END) = 0

edited Jan 26 '21 at 04:36

answered Jan 26 '21 at 04:19

Charlieface

52,284
6
19
43

What's `myList` in the second part? `myList = (a, b, c)`? Or is it the NOT `(a, b, c)` list? – mitchell-reynolds Jan 26 '21 at 04:33
1

`a,b,c` from a table rather than directly in the query. Have clarified – Charlieface Jan 26 '21 at 04:37

Given a set of IDs, Return the Subset of Orders with only those IDs

2 Answers2