Use something like TOP with GROUP BY

Question

With table table1 like below

+--------+-------+-------+------------+-------+
| flight |  orig |  dest |  passenger |  bags |
+--------+-------+-------+------------+-------+
|   1111 |  sfo  |  chi  |  david     |     3 |
|   1112 |  sfo  |  dal  |  david     |     7 |
|   1112 |  sfo  |  dal  |  kim       |     10|
|   1113 |  lax  |  san  |  ameera    |     5 |
|   1114 |  lax  |  lfr  |  tim       |     6 |
|   1114 |  lax  |  lfr  |  jake      |     8 |
+--------+-------+-------+------------+-------+

I'm aggregating the table by orig like below

select 
  orig
  , count(*) flight_cnt
  , count(distinct passenger) as pass_cnt
  , percentile_cont(0.5) within group ( order by bags ASC) as bag_cnt_med
from table1
group by orig

I need to add the passenger with the longest name ( length(passenger) ) for each orig group - how do I go about it?

Output expected

+------+-------------+-----------+---------------+-------------------+
| orig |  flight_cnt |  pass_cnt |  bags_cnt_med | pass_max_len_name |
+------+-------------+-----------+---------------+-------------------+
| sfo  |           3 |         2 |             7 |  david            |
| lax  |           3 |         3 |             6 | ameera            |
+------+-------------+-----------+---------------+-------------------+

post the output how you want...question is unclear – mohan111 May 12 '17 at 09:56 — mohan111, May 12 '17 at 09:56
and if both names are same length? – Vao Tsun May 12 '17 at 10:13 — Vao Tsun, May 12 '17 at 10:13
BTW, it's `length()` or `char_length()`, not `len()` – Erwin Brandstetter May 12 '17 at 12:24 — Erwin Brandstetter, May 12 '17 at 12:24

score 5 · Accepted Answer · edited May 23 '17 at 12:02

You can conveniently retrieve the passenger with the longest name per group with DISTINCT ON.

Select first row in each GROUP BY group?

But I see no way to combine that (or any other simple way) with your original query in a single SELECT. I suggest to join two separate subqueries:

SELECT *
FROM  (  -- your original query
   SELECT orig
        , count(*) AS flight_cnt
        , count(distinct passenger) AS pass_cnt
        , percentile_cont(0.5) WITHIN GROUP (ORDER BY bags) AS bag_cnt_med
   FROM   table1
   GROUP  BY orig
   ) org_query
JOIN  (  -- my addition
   SELECT DISTINCT ON (orig) orig, passenger AS pass_max_len_name
   FROM   table1
   ORDER  BY orig, length(passenger) DESC NULLS LAST
   ) pas USING (orig);

USING in the join clause conveniently only outputs one instance of orig, so you can simply use SELECT * in the outer SELECT.

If passenger can be NULL, it is important to add NULLS LAST:

PostgreSQL sort by datetime asc, null first?

From multiple passenger names with the same maximum length in the same group, you get an arbitrary pick - unless you add more expressions to ORDER BY as tiebreaker. Detailed explanation in the answer linked above.

Performance?

Typically, a single scan is superior, especially with sequential scans.

The above query uses two scans (maybe index / index-only scans). But the second scan is comparatively cheap unless the table is too huge to fit in cache (mostly). Lukas suggested an alternative query with only a single SELECT adding:

, (ARRAY_AGG (passenger ORDER BY LENGTH (passenger) DESC))[1]  -- I'd add NULLS LAST

The idea is smart, but last time I tested, array_agg with ORDER BY did not perform so well. (The overhead of per-group ORDER BY is substantial, and array handling is expensive, too.)

The same approach can be cheaper with a custom aggregate function first() like instructed in the Postgres Wiki here. Or, faster, yet, with a version written in C, available on PGXN. Eliminates the extra cost for array handling, but we still need per-group ORDER BY. May be faster for only few groups. You would then add:

 , first(passenger ORDER BY length(passenger) DESC NULLS LAST)

Gordon and Lukas also mention the window function first_value(). Window functions are applied after aggregate functions. To use it in the same SELECT, we would need to aggregate passenger somehow first - catch 22. Gordon solves this with a subquery - another candidate for good performance with standard Postgres.

first() does the same without subquery and should be simpler and a bit faster. But it still won't be faster than a separate DISTINCT ON for most cases with few rows per group. For lots of rows per group, a recursive CTE technique is typically faster. There are yet faster techniques if you have a separate table holding all relevant, unique orig values. Details:

Optimize GROUP BY query to retrieve latest record per user

The best solution depends on various factors. The proof of the pudding is in the eating. To optimize performance you have to test with your setup. The above query should be among the fastest.

Hmm, that means two table accesses, when it is possible with only one... — Lukas Eder, May 12 '17 at 12:57
@LukasEder: You are right, basically. A single scan would be lovely. But last time I tested, `array_agg` with `ORDER BY` did not perform so well. A subquery with `first_value()` has better chances, but there are faster alternatives ... I added a chapter discussing options. — Erwin Brandstetter, May 13 '17 at 00:32
Thanks for your insight. Very interesting. `first()` might be a good candidate to add to PostgreSQL as an out-of-the-box feature. Could be similarly useful as Oracle's `KEEP` clause. — Lukas Eder, May 13 '17 at 16:12
@LukasEder: Yeah, `first()` would be handy in standard Postgres. It might be a trap for naive users falsely assuming a natural order of rows who forget to add `ORDER BY`. — Erwin Brandstetter, May 13 '17 at 16:34
In Oracle's `KEEP` clause, `ORDER BY` is a syntactic requirement. Sure, you could still implement random ordering by using something like `ORDER BY 1` or `ORDER BY (SELECT 1)`, but I guess that's OK — Lukas Eder, May 14 '17 at 08:02

score 2 · Answer 2 · answered May 12 '17 at 10:32

One method uses the window function first_value(). Unfortunately, this is not available as an aggregation function:

select orig,
       count(*) flight_cnt,
       count(distinct passenger) as pass_cnt,
       percentile_cont(0.5) within group ( order by bags ASC) as bag_cnt_med,
       max(longest_name) as longest_name
from (select t1.*,
             first_value(name) over (partition by orig order by length(name) desc) as longest_name
      from table1
     ) t1
group by orig;

score 1 · Answer 3 · answered May 12 '17 at 10:21

You are looking for something like Oracle's KEEP FIRST/LAST where you get a value (the passenger name) according to an aggregate (the name length). PostgreSQL doesn't have such function as far as I know.

One way to go about this is a trick: Combine length and name, get the maximum, then extract the name: '0005david' > '0003kim' etc.

select 
  orig
  , count(*) flight_cnt
  , count(distinct passenger) as pass_cnt
  , percentile_cont(0.5) within group ( order by bags ASC) as bag_cnt_med,
  , substr(max(to_char(char_length(passenger), '0000') || passenger), 5) as name
from table1
group by orig
order by orig;

score 1 · Answer 4 · edited May 23 '17 at 12:34

For small group sizes, you could use array_agg()

SELECT
  orig
  , COUNT (*) AS flight_cnt
  , COUNT (DISTINCT passenger) AS pass_cnt
  , PERCENTILE_CONT (0.5) WITHIN GROUP (ORDER BY bags ASC) AS bag_cnt_med
  , (ARRAY_AGG (passenger ORDER BY LENGTH (passenger) DESC))[1] AS pass_max_len_name
FROM table1
GROUP BY orig

Having said so, while this is shorter syntax, a first_value() window function based approach might be faster for larger data sets as array accumulation might become expensive.

Vao Tsun · Answer 5 · 2017-05-12T10:30:24.590

bot it does not solve problem if you have several names wqith same length:

t=# with p as (select distinct orig,passenger,length(trim(passenger)),max(length(trim(passenger))) over (partition by orig) from s127)
, o as (    select
      orig
      , count(*) flight_cnt
      , count(distinct passenger) as pass_cnt
      , percentile_cont(0.5) within group ( order by bags ASC) as bag_cnt_med
    from s127
    group by orig)
select distinct o.*,p.passenger from o join p on p.orig = o.orig where max=length;
  orig   | flight_cnt | pass_cnt | bag_cnt_med |  passenger
---------+------------+----------+-------------+--------------
   lax   |          3 |        3 |           6 |   ameera
   sfo   |          3 |        2 |           7 |   david
(2 rows)

populate:

t=# create table s127(flight int,orig text,dest text, passenger text, bags int);
CREATE TABLE
Time: 52.678 ms
t=# copy s127 from stdin delimiter '|';
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.
>>    1111 |  sfo  |  chi  |  david     |     3
>>    1112 |  sfo  |  dal  |  david     |     7
   1112 |  sfo  |  dal  |  kim       |     10
   1113 |  lax  |  san  |  ameera    |     5
   1114 |  lax  |  lfr  |  tim       |     6
   1114 |  lax  |  lfr  |  jake      |     8 >> >> >> >>
>> \.
COPY 6

expected output is different, updated question with expected output. — user3206440, May 12 '17 at 10:06

Use something like TOP with GROUP BY

5 Answers5

Performance?

Linked