5

I'm try to get the mode() for a grouped data set, but without grouping the results. (Using Postgres 9.5, can upgrade if needed.)

e.g. Users have a 'favorite color', and belong to a single group. Get list of users with the mode() 'favorite color' within their group.

A window function would work for most aggregates, but mode() seems to be an exception that isn't compatible with window functions. Is there another way to go about this? Here's what I've been toying with so far ...

Works but gives grouped results, I'm looking for the results to be ungrouped:

SELECT group_id, 
    mode() WITHIN GROUP (ORDER BY color)
FROM users
GROUP BY group_id;

Invalid syntax (just an example of what I'm trying to accomplish):

SELECT id, color, group_id, 
    mode(color) OVER (PARTITION BY group_id)
FROM users;

Or:

SELECT id, color, group_id, 
    mode() WITHIN GROUP (ORDER BY color) OVER (PARTITION BY group_id)
FROM users;

I tried using a lateral join, but couldn't get it to work right without re-iterating my WHERE clause both inside and outside the join (which I'd prefer not to do for when this query gets more complicated):

SELECT u1.id, u1.group_id, u1.color, mode_color
FROM users u1
LEFT JOIN LATERAL
    (SELECT group_id, mode() WITHIN GROUP (ORDER BY color) as mode_color
     FROM users
     WHERE group_id = d1.group_id
     GROUP BY group_id)
    u2 ON u1.group_id = u2.group_id
WHERE u1.type = 'customer';

It's important that WHERE u1.type = 'customer' stays outside of the subquery, as that's being appended to the query at a later point, after the first half of it is already written.

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
PeanutsMcgee
  • 55
  • 1
  • 8
  • You filter by `type`, but aggregate by `group_id`. This complicates matters and it would help to know how the two are related. Is one a subgroup of the other? It also raises the question: which mode *exactly*? The one after filtering, or the one before (for *all* rows in the same group)? And *always* disclose your version of Postgres, please. – Erwin Brandstetter Apr 05 '19 at 23:37

1 Answers1

4

We are talking about the ordered-set aggregate function mode(), introduced with Postgres 9.4. You probably saw this error message:

ERROR:  OVER is not supported for ordered-set aggregate mode

We can work around it. But which mode exactly?

(All assuming group_id and type are NOT NULL, else you need to do more.)

Mode of qualifying rows

This computes the mode based on the filtered set (with type = 'customer') alone.
You get the most popular color per group among "customers".

A subquery in a plain JOIN (without LEFT and LATERAL in this case) would do the job - calculating the mode once per group, not for every individual row:

SELECT u1.id, u1.group_id, u1.color, u2.mode_color
FROM   users u1
JOIN  (                            -- not LATERAL
   SELECT group_id, type           -- propagate out for the join
        , mode() WITHIN GROUP (ORDER BY color) AS mode_color
   FROM   users 
   WHERE  type = 'customer'        -- place condition in subquery (cheap)
   GROUP  BY group_id, type
   ) u2 USING (group_id, type);    -- shorthand syntax for matching names
-- WHERE  type = 'customer'        -- or filter later (expensive)

To avoid repeating your condition, place it in the subquery and propagate it to the outer query in the join clause - I picked matching column names and joined with USING in my example.

You can move the condition to the outer query or even to a later step, yet. It will be needlessly more expensive, though, as the mode for every combination of (group_id, type) has to be calculated, before the results for every other type is excluded in a later step.

There are ways to parameterize your query. Prepared statements, PL/pgSQL function, see:

Or, if the underlying table does not change much, a materialized view with all pre-computed modes per (group_id, type) replacing the subquery would be an option.

One more option: use a CTE to filter qualifying rows first, then the WHERE condition can stay outside of the subquery like you requested:

WITH cte AS (  -- filter result rows first
   SELECT id, group_id, color
   FROM   users u1
   WHERE  type = 'customer'        -- predicate goes here
   )
SELECT *
FROM   cte u1
LEFT   JOIN (                      -- or JOIN, doesn't matter here
   SELECT group_id
        , mode() WITHIN GROUP (ORDER BY color) AS mode_color
   FROM   cte                      -- based on only qualifying rows
   GROUP  BY 1
   ) u2 USING (group_id);

We can simplify with SELECT * since USING conveniently places only one group_id in the result set.

Mode of all rows

If you want to base the mode on all rows (including those where type = 'customer' is not true), you need a different query.
You get the most popular color per group among all members.

Move the WHERE clause to the outer query:

SELECT u1.id, u1.group_id, u1.color, u2.mode_color
FROM   users u1
LEFT   JOIN (                      -- or JOIN, doesn't matter here
   SELECT group_id
        , mode() WITHIN GROUP (ORDER BY color) AS mode_color
   FROM   users
   GROUP  BY group_id
   ) u2 USING (group_id)
WHERE  u1.type = 'customer';

If your predicate (type = 'customer') is selective enough, computing the mode for all groups may be a waste. Filter the small subset first and only compute the mode for contained groups. Add a CTE for this:

WITH cte AS (  -- filter result rows first
   SELECT id, group_id, color
   FROM   users u1
   WHERE  type = 'customer'
   )
SELECT *
FROM   cte u1
LEFT   JOIN (        -- or JOIN
   SELECT group_id
        , mode() WITHIN GROUP (ORDER BY color) AS mode_color
   FROM  (SELECT DISTINCT group_id FROM cte) g  -- only relevant groups
   JOIN   users USING (group_id)                -- but consider all rows for those
   GROUP  BY 1
   ) u2 USING (group_id);

Similar to the CTE query above, but based on all group members in the base table.

Community
  • 1
  • 1
Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
  • Thanks for the great reply, the "Mode of qualifying rows" is what I'm looking for, but I really need to keep the conditions outside of the subquery. That's why I was looking into lateral joins (but maybe I'm misunderstand what those are used for) In my use case, this is a small piece of a much larger query, with many conditions being appended to the query later, after this piece is already written. – PeanutsMcgee Apr 06 '19 at 15:58
  • @PeanutsMcgee: Updated to address that, see above. – Erwin Brandstetter Apr 06 '19 at 19:11
  • Sorry I should have been more clear, but I don't think this is a solution for me, I might just be asking for something impossible. The conditions are dynamic, it's not always using 'type', it may be 'created_at' or 'last_name' or a dozen other filterable choices in any combination. That's why I stressed that the conditions stay out of the subquery. Moving the subquery condition from a WHERE to a GROUP BY doesn't help. Thank you a ton for you effort though, I learned a lot. – PeanutsMcgee Apr 06 '19 at 22:03
  • @PeanutsMcgee: I added one more option with a CTE. Won't get better than this. Computing per row with a lateral join would multiply the work to be done by Postgres. – Erwin Brandstetter Apr 10 '19 at 00:56