23

I'm new to Postgres, coming from MySQL and hoping that one of y'all would be able to help me out.

I have a table with three columns: name, week, and value. This table has a record of the names, the week at which they recorded the height, and the value of their height. Something like this:

Name  |  Week  | Value
------+--------+-------
John  |  1     | 9
Cassie|  2     | 5
Luke  |  6     | 3
John  |  8     | 14
Cassie|  5     | 7
Luke  |  9     | 5
John  |  2     | 10
Cassie|  4     | 4
Luke  |  7     | 4

What I want is a list per user of the value at the minimum week and the max week. Something like this:

Name  |minWeek | Value |maxWeek | value
------+--------+-------+--------+-------
John  |  1     | 9     | 8      | 14
Cassie|  2     | 5     | 5      | 7
Luke  |  6     | 3     | 9      | 5

In Postgres, I use this query:

select name, week, value
from table t
inner join(
select name, min(week) as minweek
from table
group by name)
ss on t.name = ss.name and t.week = ss.minweek
group by t.name
;

However, I receive an error:

column "w.week" must appear in the GROUP BY clause or be used in an aggregate function
Position: 20

This worked fine for me in MySQL so I'm wondering what I'm doing wrong here?

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
user3915795
  • 233
  • 1
  • 2
  • 6
  • 1
    So how do you expect to `GROUP BY` the second column if its result is only known **AFTER** grouping? – zerkms Aug 06 '14 at 21:09
  • 3
    This statement logically doesn't make any sense. Going from MySql to Postgres, you'll have to get used to the fact that you can no longer do things that make no sense. – Mike Christensen Aug 06 '14 at 21:10
  • 3
    "This worked fine for me in MySQL so I'm wondering what I'm doing wrong here?" - MySQL doesn't handle grouping well and will do the wrong thing without returning an error while Postgres is smart enough to return an error. It's not working fine on MySQL, it's doing the wrong thing without an error – Twelfth Aug 06 '14 at 21:12
  • 3
    MySql literally just makes up random stuff to do if you give it these sorts of queries. – Mike Christensen Aug 06 '14 at 21:13
  • Sorry edited the query that I've been using in mySQL - that's the query that works in mySQL – user3915795 Aug 06 '14 at 21:13
  • And what is the error message for the new query? – zerkms Aug 06 '14 at 21:14
  • column "t.week" must appear in the GROUP BY clause or be used in an aggregate function Position: 20 – user3915795 Aug 06 '14 at 21:16
  • 2
    In MySQL I get `ERROR 1052 (23000): Column 'name' in field list is ambiguous` with your query. – jjanes Aug 06 '14 at 21:20
  • 1
    If you're grouping by `t.name`, there could be multiple rows in that group. Thus, trying to reference `t.week` makes no sense. Which `t.week` are you talking about? There could be one, there could be 50. – Mike Christensen Aug 06 '14 at 21:21

2 Answers2

47

There are various simpler and faster ways.

2x DISTINCT ON

SELECT *
FROM  (
   SELECT DISTINCT ON (name)
          name, week AS first_week, value AS first_val
   FROM   tbl
   ORDER  BY name, week
   ) f
JOIN (
   SELECT DISTINCT ON (name)
          name, week AS last_week, value AS last_val
   FROM   tbl
   ORDER  BY name, week DESC
   ) l USING (name);

Or shorter:

SELECT *
FROM  (SELECT DISTINCT ON (1) name, week AS first_week, value AS first_val FROM tbl ORDER BY 1,2) f
JOIN  (SELECT DISTINCT ON (1) name, week AS last_week , value AS last_val  FROM tbl ORDER BY 1,2 DESC) l USING (name);

Simple and easy to understand. Also fastest in my old tests. Detailed explanation for DISTINCT ON:

2x window function, 1x DISTINCT ON

SELECT DISTINCT ON (name)
       name, week AS first_week, value AS first_val
     , first_value(week)  OVER w AS last_week
     , first_value(value) OVER w AS last_value
FROM   tbl t
WINDOW w AS (PARTITION BY name ORDER BY week DESC)
ORDER  BY name, week;

The explicit WINDOW clause only shortens the code, no effect on performance.

first_value() of composite type

The aggregate functions min() or max() do not accept composite types as input. You would have to create custom aggregate functions (which is not that hard).
But the window functions first_value() and last_value() do. Building on that we can devise simple solutions:

Simple query

SELECT DISTINCT ON (name)
       name, week AS first_week, value AS first_value
     ,(first_value((week, value)) OVER (PARTITION BY name ORDER BY week DESC))::text AS l
FROM   tbl t
ORDER  BY name, week;

The output has all data, but the values for the last week are stuffed into an anonymous record (optionally cast to text). You may need decomposed values.

Decomposed result with opportunistic use of table type

For that we need a well-known composite type. An adapted table definition would allow for the opportunistic use of the table type itself directly:

CREATE TABLE tbl (week int, value int, name text);  -- optimized column order

week and value come first, so now we can sort by the table type itself:

SELECT (l).name, first_week, first_val
     , (l).week AS last_week, (l).value AS last_val
FROM  (
   SELECT DISTINCT ON (name)
          week AS first_week, value AS first_val
        , first_value(t) OVER (PARTITION BY name ORDER BY week DESC) AS l
   FROM   tbl t
   ORDER  BY name, week
   ) sub;

Decomposed result from user-defined row type

That's probably not possible in most cases. Register a composite type with CREATE TYPE (permanent) or with CREATE TEMP TABLE (for the duration of the session):

CREATE TEMP TABLE nv(last_week int, last_val int);  -- register composite type
SELECT name, first_week, first_val, (l).last_week, (l).last_val
FROM (
   SELECT DISTINCT ON (name)
          name, week AS first_week, value AS first_val
        , first_value((week, value)::nv) OVER (PARTITION BY name ORDER BY week DESC) AS l
   FROM   tbl t
   ORDER  BY name, week
   ) sub;

Custom aggregate functions first() & last()

Create functions and aggregates once per database:

CREATE OR REPLACE FUNCTION public.first_agg (anyelement, anyelement)
  RETURNS anyelement
  LANGUAGE sql IMMUTABLE STRICT PARALLEL SAFE AS
'SELECT $1';

CREATE AGGREGATE public.first(anyelement) (
  SFUNC = public.first_agg
, STYPE = anyelement
, PARALLEL = safe
);


CREATE OR REPLACE FUNCTION public.last_agg (anyelement, anyelement)
  RETURNS anyelement
  LANGUAGE sql IMMUTABLE STRICT PARALLEL SAFE AS
'SELECT $2';

CREATE AGGREGATE public.last(anyelement) (
  SFUNC = public.last_agg
, STYPE = anyelement
, PARALLEL = safe
);

Then:

SELECT name
     , first(week) AS first_week, first(value) AS first_val
     , last(week)  AS last_week , last(value)  AS last_val
FROM  (SELECT * FROM tbl ORDER BY name, week) t
GROUP  BY name;

Probably the most elegant solution. Faster with the additional module first_last_agg providing a C implementation.
Compare instructions in the Postgres Wiki.

Related:



db<>fiddle here (showing all)
Old sqlfiddle

Each of these queries was substantially faster than the currently accepted answer in a quick test on a table with 50k rows with EXPLAIN ANALYZE.

There are more ways. Depending on data distribution, different query styles may be (much) faster, yet. See:

kworr
  • 3,579
  • 1
  • 21
  • 33
Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
  • Asked it above, but repeat it here for this particular answer: Do you think this deserves some particular indexes for this to work better? I have an index by what would be here 'name' and 'week' (date in my case), but the query is taking ages in a 60M rows table. Perhaps a compound index by name and date, in that order? – Fermin Silva Jun 27 '16 at 16:15
  • 1
    @FerminSilva: Consider this related answer discussing performance of "greatest n per group" problems in depth: ["Optimize GROUP BY query to retrieve latest record per user"](http://stackoverflow.com/a/25536748/939860) - including ways to optimize and indexes. – Erwin Brandstetter Jun 28 '16 at 14:36
  • The "*2x window function*" seems most elegant. Is there any reason not to use it? What did you mean by "*no effect on performance*"? – Bergi Oct 09 '20 at 23:25
  • 2
    @Bergi: An explicit `WINDOW` clause has "no effect on performance" when compared to spelling out the window clause with `OVER (...)` per window function (repeatedly). "2x window function" is elegant, but it's more expensive to first compute window functions for every row, only to eliminate duplicates per group in the next step. Other variants are faster. – Erwin Brandstetter Oct 13 '20 at 13:04
  • @ErwinBrandstetter A pity it can't do that together :-/ I guess this would need `first_value` as an aggregate function, right? (Like in your linked answer) Is there a motion to add this as a native feature? – Bergi Oct 13 '20 at 13:59
9

This is a bit of a pain, because Postgres has the nice window functions first_value() and last_value(), but these are not aggregation functions. So, here is one way:

select t.name, min(t.week) as minWeek, max(firstvalue) as firstvalue,
       max(t.week) as maxWeek, max(lastvalue) as lastValue
from (select t.*, first_value(value) over (partition by name order by week) as firstvalue,
             last_value(value) over (partition by name order by week) as lastvalue
      from table t
     ) t
group by t.name;
Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
  • It works, but it is not elegant, not friendly... And perhaps lost performance (not need `max()` comparisons). Why PostgreSQL not use (or its community not like to) the first/last as build-in aggregate functions?? There are [external lib for fast first/last](http://pgxn.org/dist/first_last_agg/), a problem with it? – Peter Krauss Mar 25 '16 at 16:31
  • Do you think this deserves some particular indexes for this to work better? I have an index by what would be here 'name' and 'week' (date in my case), but the query is taking ages in a 60M rows table. Perhaps a compound index by name and date? – Fermin Silva Jun 27 '16 at 16:12
  • An index on `(name, week, value)` might help the query. – Gordon Linoff Jun 28 '16 at 03:03
  • NOTE: this doesn't necessarily work with all window functions (ie sum) where the MAX might note be the final record when some/all values are negative. in order to fix it you'd need to use `row_number()` with the partition and then use another window get just the last row (highest row_number). the 2x Distinct solution from erwin is better imo. – pstanton Feb 28 '17 at 06:51
  • @pstanton . . . I simply do not understand your comment. – Gordon Linoff Feb 28 '17 at 13:29
  • @GordonLinoff firstly, great answer but doesn't work for all window functions - in the example you give it is fine. but if you wanted to use `sum` alongside `last_value` and `first_value`. doing the `max` on `sum` would not necessarily give you the final sum (ie the last sum value). you would need another window function to get that. – pstanton Feb 28 '17 at 21:40