PostgreSQL. Select a column that correlates with value in the aggregate function

Question

Here is the 'items' table, containing more than 10 rows:

+-----+-----------+-----------+----------+
| id  | item_name | category  | quantity |
+=====+===========+===========+==========+
| 3   | item33    | category1 | 5        |
+-----+-----------+-----------+----------+
| 2   | item52    | category5 | 1        |
+-----+-----------+-----------+----------+
| 1   | item46    | category1 | 3        |
+-----+-----------+-----------+----------+
| 4   | item11    | category3 | 2        |
+-----+-----------+-----------+----------+
| ... | ...       | ...       | ...      |
+-----+-----------+-----------+----------+

Values in the 'items' column are unique, the ones in the 'category' columnt - aren't unique.

The task is:

Remove duplicates of categories: if a category contains more than 1 item, take the row with minimal 'id'.
Order results by the 'quantity' (ASC).
Take 10 rows: top 5 and random 5 from the rest result data output.

So, the ordering table (after #2 sub-task) should look like that:

+-----+-----------+-----------+----------+
| id  | item_name | category  | quantity |
+=====+===========+===========+==========+
| 2   | item52    | category5 | 1        |
+-----+-----------+-----------+----------+
| 4   | item11    | category3 | 2        |
+-----+-----------+-----------+----------+
| 1   | item46    | category1 | 3        |
+-----+-----------+-----------+----------+
| ... | ...       | ...       | ...      |
+-----+-----------+-----------+----------+

I know how to exclude duplicates for categories:

SELECT min(id) as id, category
FROM items
GROUP BY category

But I don't know how to order it by the quantity. If I try to add 'quantity' to the 'select' line and then make 'ORDER BY quantity', I get the error: "column "quantity" must appear in the GROUP BY clause or be used in an aggregate function".

If there is a way to add this 'quantity' column to the data output (the value in this column should correlate with the resulting 'id' value (i.e. "min(id)"))? And then do ordering and picking rows...

score 2 · Accepted Answer · answered Dec 25 '20 at 05:17

Consider joining back your aggregate query to the unit level data for all columns including quantity:

SELECT i.id, i.item_name, i.category, i.quantity
FROM items i
INNER JOIN 
  (SELECT category, min(id) AS min_id
   FROM items
   GROUP BY category) agg
 ON i.id = agg.min_id
 AND i.category = agg.category
ORDER BY i.quantity

For top 5 and random 5 split, integrate a union with CTE to hold the resultset:

WITH sub AS (
  SELECT i.id, i.item_name, i.category, i.quantity
  FROM items i
  INNER JOIN 
    (SELECT category, min(id) AS min_id
     FROM items
     GROUP BY category) agg
   ON i.id = agg.min_id
   AND i.category = agg.category
)

-- TOP 5 ROWS
SELECT id, item_name, category, quantity
FROM sub
ORDER BY i.quantity
LIMIT 5

UNION

-- RANDOM ROWS OF NON-TOP 5
SELECT id, item_name, category, quantity
FROM 
  (SELECT id, item_name, category, quantity
   FROM sub
   ORDER BY i.quantity
   OFFSET 5) below5
ORDER BY random()
LIMIT 5

This solution works best for me, 'cause with my little knowledge of Postgres I can at least understand this code :D Thanks for the help everyone, regardless. I really appreciate it. — Ivan, Dec 26 '20 at 01:40
Great to hear and glad to help! This solution also works in other RDBMS's and not limited to Postgres dialect methods. Happy SQLing! — Parfait, Dec 26 '20 at 04:45

Popeye · Answer 2 · 2020-12-25T05:08:46.450

1

You need to use analytical function as follows:

Select * from
(Select t.*,
       Row_number() over (order by quantity) as rn_q
 from
(Select t.*,
       Row_number() over (partition by category order by id) as rn
  From your_table) t
Where rn = 1) t
Order by case when rn_q <= 5 then quantity else 6 end;

edited Dec 25 '20 at 05:08

answered Dec 25 '20 at 03:21

Popeye

35,427
4
10
31

Erwin Brandstetter · Answer 3 · 2020-12-25T08:35:12.043

Basically, DISTINCT ON serves nicely in Postgres. See:

Simple (correct!) solution:

WITH dist_cat AS (
   SELECT t, row_number() OVER (ORDER BY quantity, id) AS rn   -- added id as tiebreaker
   FROM  (
      SELECT DISTINCT ON (category) *
      FROM   tbl
      ORDER  BY category, id
      ) t  -- distinct categories
   ORDER  BY ORDER BY quantity, id  -- match sort for row_number()
   )
SELECT (t).*
FROM   dist_cat
WHERE  rn <= 5

UNION ALL   -- not just UNION
(  -- parentheses required
SELECT (t).*
FROM   dist_cat
WHERE  rn > 5
ORDER  BY random()
LIMIT  5
);

Added id as tiebreaker to the sort, as sorting by quantity is hardly deterministic. Put any unique expression there that suits your requirements. Or skip it if you are OK with arbitrary results that may change with every call.

The row type t is for convenience, so we don't have to spell out all column names, and still get rid of the appended rn in the result, which has not been requested.

I chose to order rows in the CTE and attach a row nubmer rn to avoid an additional sort operation.

The additional 5 random rows are truly picked randomly, not just arbitrarily.

Use UNION ALL, not just UNION. Because it's correct for what we are doing, and cheaper, too. But also to preserve the the sort order from the CTE; UNION can mess it up trying to remove duplicates - in vain.

For big tables, depending on data distribution, there may be (much) faster techniques ...

... for getting unique categories:

Optimize GROUP BY query to retrieve latest row per user

.. for getting random rows:

Best way to select random rows PostgreSQL

PostgreSQL. Select a column that correlates with value in the aggregate function

3 Answers3