Create two arrays for two fields, keeping sort order of arrays in sync (without subquery)

Question

There is no rhyme or reason for this question other than I was curious about how one would go about doing this.

Platform: while I was hoping for a SQL-Standard solution, my main concentration is with PostgreSQL 8.4+. (I know 9.0+ has some array sorting functions.)

SELECT    id, group, dt
FROM      foo
ORDER BY  id;

  id   | group |    dt
-------+-------+-----------
   1   |  foo  | 2012-01-01
   1   |  bar  | 2012-01-03
   1   |  baz  | 2012-01-02
   2   |  foo  | 2012-01-01
   3   |  bar  | 2012-01-01
   4   |  bar  | 2012-01-01
   4   |  baz  | 2012-01-01

I know the following query is wrong, but the result is similar to what I'm after; a way to tie the two fields (sorting of group should also sort dt):

SELECT    id, sort_array(array_agg(group)), array_agg(dt)
FROM      foo
GROUP BY  id;

  id   |     group      |                dt
-------+----------------+------------------------------------
   1   |  {bar,baz,foo} | {2012-01-03,2012-01-02,2012-01-01}
   2   |  {foo}         | {2012-01-01}
   3   |  {bar}         | {2012-01-01}
   4   |  {bar,baz}     | {2012-01-01,2012-01-01}

Is there an easy way to tie the fields for sorting, w/o using a subquery? Perhaps build an array of arrays and then unnest?

I welcome downvotes, but wished I knew why to improve for next time. — vol7ron, Mar 19 '12 at 17:45
+1 I upvoted, it is a decent question. The program version is missing, though (that's *always* a must.) And the rationale behind the request. — Erwin Brandstetter, Mar 19 '12 at 19:10
I don't mind being criticized. All for self-improvement. Updated the question, cheers — vol7ron, Mar 19 '12 at 19:46

score 4 · Accepted Answer · edited May 23 '17 at 12:24

I changed your column name group to grp because group is a reserved word in Postgres and every SQL standard and shouldn't be used as identifier.

I understand your question like this:

Get the two arrays sorted in identical sort order so that the same element position corresponds to the same row in both arrays.

Use a subquery or CTE and order the rows before you aggregate.

SELECT id, array_agg(grp) AS grp, array_agg(dt) AS dt
FROM  (
    SELECT *
    FROM   tbl
    ORDER  BY id, grp, dt
    ) x
GROUP  BY id;

That's faster than to use individual ORDER BY clauses in the aggregate function array_agg() like @Mosty demonstrates (and which has been there since PostgreSQL 9.0). Mosty also interprets your question differently and uses the proper tools for his interpretation.

Is `ORDER BY` in a subquery safe?

The manual:

The aggregate functions array_agg, json_agg, [...] as well as similar user-defined aggregate functions, produce meaningfully different result values depending on the order of the input values. This ordering is unspecified by default, but can be controlled by writing an ORDER BY clause within the aggregate call, as shown in Section 4.2.7. Alternatively, supplying the input values from a sorted subquery will usually work. For example:
SELECT xmlagg(x) FROM (SELECT x FROM test ORDER BY y DESC) AS tab;
Beware that this approach can fail if the outer query level contains additional processing, such as a join, because that might cause the subquery's output to be reordered before the aggregate is computed.

So yes, it's safe in the example.

Without subquery

If you really need a solution without subquery, you can:

SELECT id
     , array_agg(grp ORDER BY grp)
     , array_agg(dt  ORDER BY grp, dt)
FROM   tbl
GROUP  BY id;

Note the ORDER BY grp, dt. I sort by dt in addition to break ties and make the sort order unambiguous. Not necessary for grp, though.

There is also a completely different way to do this, with window functions:

SELECT DISTINCT ON (id)
       id
     , array_agg(grp) OVER w AS grp
     , array_agg(dt)  OVER w AS dt
FROM   tbl
WINDOW w AS (PARTITION BY id ORDER BY grp, dt
             ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
ORDER  BY id;

Note the DISTINCT ON (id) instead of just DISTINCT which produces the same result but performs faster by an order of magnitude because we do not need an extra sort.

I ran some tests and this is almost as fast as the other two solutions. As expected, the subquery version was still fastest. Test with EXPLAIN ANALYZE to see for yourself.

I knew someone was going to comment on the group reserved word :) I thought foo would be enough to show it's only an example. But I was more inquiring about a way to sort w/o the use of an explicit subquery. — vol7ron, Mar 19 '12 at 17:28

score 1 · Answer 2 · edited Mar 19 '12 at 20:45

1

The only way that I know to flatten this out is to use a recursive CTE. Here is what the query would look like:

--We first need to create the order hierarchy to recurse properly
SELECT *, ROW_NUMBER() OVER(PARTITION BY id ORDER BY group) AS rownum
INTO TEMP TableToRecurse
FROM foo

WITH RECURSIVE FinalOutput (id, group, dt, rownum) AS
(
--Anchor row of recursion
SELECT id, group, dt, rownum
FROM TableToRecurse 
WHERE rownum = 1
UNION ALL 
--Recursion piece
SELECT tr.id, FinalOutput.group || ', ' || tr.group, 
    FinalOutput.dt || ', ' || tr.dt, tr.rownum
FROM TableToRecurse AS tr
    JOIN FinalOutput
        ON  FinalOutput.id = tr.id AND FinalOutput.rownum = tr.rownum +1
)
--Final output only showing the last row (Max)
--Which should have everything concatenated together
SELECT FinalOutput.id, FinalOutput.group, FinalOutput.dt
FROM FinalOutput 
    JOIN 
    (
        SELECT MAX(rownum) AS MaxRowNum, id
        FROM FinalOutput
        GROUP BY id
    ) AS MaxForEach
        ON FinalOutput.id = MaxForEach.id 
            AND FinalOutput.rownum = MaxForEach.MaxRowNum

edited Mar 19 '12 at 20:45

Erwin Brandstetter

605,456
145
1,078
1,228

answered Mar 19 '12 at 15:23

Justin Pihony

66,056
18
147
180

Pretty interesting - I'll have to look at this in more detail. – vol7ron Mar 19 '12 at 15:36
This is not valid PostgreSQL syntax. Looks like it is for SQL Server? – Erwin Brandstetter Mar 19 '12 at 16:15
Yes, my background is SQL Server, but I did double check that this could be done in PostGreSQL (see my link). I noticed that the temp table syntax is different, but I believe that should be fixed now. Otherwise, what is wrong with this? – Justin Pihony Mar 19 '12 at 16:25
Yeah, I saw when there were hash marks it wasn't Postgres, but I think the Recursive structure between the two databases are the same. – vol7ron Mar 19 '12 at 19:43
1

@JustinPihony: The string concatenation operator in Postgres is `||` - which is also the SQL standard. `+` is not defined for strings, that's a Microsoft thing. You also still have the `#` in front of the temp table name. You might be interested in the site `sqlfiddle.com` to test the your syntax. The basic structure of the recursive CTE works, though. – Erwin Brandstetter Mar 19 '12 at 20:13

score 1 · Answer 3 · answered Mar 19 '12 at 15:39

1

Try this:

select id,
  array_agg(agroup order by agroup) as AGroup,
  array_agg(dt order by dt desc) as dt
from t
group by id

This should work on PostgreSQL 9.1+

answered Mar 19 '12 at 15:39

Mosty Mostacho

42,742
16
96
123

I think what I was after would be more `array_agg(dt order by agroup) as dt`, I don't have 9.1, but +1 anyways as I'm sure I'll have it in the future – vol7ron Mar 19 '12 at 19:44

Create two arrays for two fields, keeping sort order of arrays in sync (without subquery)

3 Answers3

Is `ORDER BY` in a subquery safe?

Without subquery

Linked

Create two arrays for two fields, keeping sort order of arrays in sync (without subquery)

3 Answers3

Is ORDER BY in a subquery safe?

Without subquery

Linked

Is `ORDER BY` in a subquery safe?