5

Is there a proper way to aggregate a single column when I have many other columns in the query?

I've tried this answer which works, but my query has become a lot more verbose.

My current query looks like this:

SELECT t1.foo1, t1.foo2, t2.foo3, t2.foo4, string_agg(t3.aggregated_field, ', ')
FROM tbl1 t1
LEFT JOIN tbl2 t2 ON t1.id = t2.fkeyid
LEFT JOIN tbl3 t3 ON t2.id = t3.fkeyid
GROUP BY t1.foo1, t1.foo2, t2.foo3, t2.foo4, t2.foo5, t2.foo6
ORDER BY t2.foo5, t2.foo6

The query has many more fields and LEFT JOINs, the important part is that all these fields have 1 to 1 or 1 to 0 relationship except one field that is 1 to n which I want to aggregate, represented by t3.aggregated_field in the pseudo-query above.

As I'm using an aggregate function, all fields listed in the SELECT and ORDER BY must be either aggregated or part of the GROUP BY clause. This makes my query way more verbose than it already is.

That is, assuming foo1 is a primary key, when this field is repeated, all others except aggregated_field are also equal. I want these repeated rows as a single row result with the aggregated field value. (basically a select distinct with an aggregated column)

Is there a better way to do this (without having to put all other fields in the GROUP BY) or should I just iterate over the result set in my back-end executing a query for each row fetching this 1 to n relationship?


The server is running PostgreSQL 9.1.9, more specifically:

PostgreSQL 9.1.9 on x86_64-unknown-linux-gnu, compiled by gcc (GCC) 4.1.2 20080704 (Red Hat 4.1.2-54), 64-bit

Community
  • 1
  • 1
Fabrício Matté
  • 69,329
  • 26
  • 129
  • 166
  • 1
    Why would a developer *not* disclose the version of the software he is asking about. Why? It's a painful mass phenomenon here on SO. It's like otherwise highly intelligent people turn into dummies immediately once they ask a question. Supply your software version with your question. **This should go without saying.** – Erwin Brandstetter Apr 15 '13 at 19:35
  • @ErwinBrandstetter My bad, the version is 9.0+, I'll connect to server's network so I can check the exact version before I add to the question. – Fabrício Matté Apr 15 '13 at 19:40
  • Thanks. My comment is the result of frustration that has been building. It should be so obvious. And yet, so many people don't think of it. Even high-reputation folks. BTW, 9.1 - you go lucky there. My answer should work for you. – Erwin Brandstetter Apr 15 '13 at 19:52
  • @ErwinBrandstetter Yes, I understand. Though the use of `string_agg` implicitly makes it a pgsql9+, I should know better than minor versions have significant changes and I admit it was laziness on my part to not check the version beforehand. Skimming through answer makes perfect sense, I'll provide feedback in an hour or so when I have the time to apply and test it. `=]` – Fabrício Matté Apr 15 '13 at 19:55
  • BTW, major versions include first digit after the dot in PostgreSQL. [More on the official versioning site of the project.](http://www.postgresql.org/support/versioning/) – Erwin Brandstetter Apr 15 '13 at 20:06
  • @ErwinBrandstetter damn my stupidity. Thanks for the info. – Fabrício Matté Apr 15 '13 at 20:23

2 Answers2

6

Simple query

This can be much simpler with PostgreSQL 9.1 or later. As explained in this closely related answer:

It is enough to GROUP BY the primary key of a table. Since:

foo1 is a primary key

.. you can simplify your example to:

SELECT foo1, foo2, foo3, foo4, foo5, foo6, string_agg(aggregated_field, ', ')
FROM   tbl1
GROUP  BY 1
ORDER  BY foo7, foo8;  -- have to be spelled out, since not in select list!

Query with multiple tables

However, since you have:

many more fields and LEFT JOINs, the important part is that all these fields have 1 to 1 or 1 to 0 relationship except one field that is 1 to n which I want to aggregate

.. it should be faster and simpler to aggregate first, join later:

SELECT t1.foo1, t1.foo2, ...
     , t2.bar1, t2.bar2, ...
     , a.aggregated_col 
FROM   tbl1 t1
LEFT   JOIN tbl2 t2 ON ...
...
LEFT   JOIN (
   SELECT some_id, string_agg(agg_col, ', ') AS aggregated_col
   FROM   agg_tbl a ON ...
   GROUP  BY some_id
   ) a ON a.some_id = ?.some_id
ORDER  BY ...

This way the big portion of your query does not need aggregation at all.

I recently provided a test case in an SQL Fiddle to prove the point in this related answer:

Since you are referring to this related answer: No, DISTINCT is not going to help at all in this case.

Community
  • 1
  • 1
Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
  • Yes, I noticed the `DISTINCT` wouldn't help in this case a couple hours ago. I'll check your answer throughout when I get home. `=]` – Fabrício Matté Apr 15 '13 at 19:42
  • From my understanding, the sub-query will implicitly create a temporary table with the aggregations of the whole table before the join. In that case I can optimize it a bit if I put a `WHERE` clause in the sub-query, correct? This looks like the best approach, I'll adapt it to my needs. Thanks. – Fabrício Matté Apr 15 '13 at 20:30
  • @FabrícioMatté: A WHERE clause may be useful, especially if you have an index that corresponds to it. However, depending on the whole query, the Postgres query planner may use varying plans, whichever it expects to be fastest (that's where properly configured [planner cost constants](http://www.postgresql.org/docs/current/interactive/runtime-config-query.html#RUNTIME-CONFIG-QUERY-CONSTANTS) come in). It's not necessarily a "temp table" (materialize step). Test with `EXPLAIN ANALYZE` to get the details. – Erwin Brandstetter Apr 15 '13 at 20:36
1

If the main problem is that the fields (foox) are computed then this can help:

SELECT foo1, foo2, foo3, foo4, foo5, foo6, string_agg(aggregated_field, ', ')
FROM tbl1
GROUP BY 1, 2, 3, 4, 5, 6
ORDER BY 5, 6

The 1, 2... are the fields in the order they appear in the select list.

Clodoaldo Neto
  • 118,695
  • 26
  • 233
  • 260