5

In analytics processing there is often a need to collapse "unimportant" groups of data into a single row in the resulting table. One way to do this is to GROUP BY a CASE expression where unimportant groups are coalesced into a single row via the CASE expression returning a single value, e.g., NULL for the groups. This question is about efficient ways to perform this grouping in Amazon Redshift, which is based on ParAccel: close to PosgreSQL 8.0 in terms of functionality.

As an example, consider a GROUP BY on type and url in a table where each row is a single URL visit. The goal is to perform aggregation such that one row is emitted for every (type, url) pair where the URL visit count exceeds a certain threshold and one (type, NULL) row is emitted for all (type, url) pairs where the visit count is under that threshold. The rest of the columns in the result table would have SUM/COUNT aggregates based on this grouping.

For example, the following data

+------+----------------------+-----------------------+
| type | url                  | < 50+ other columns > |
+------+----------------------+-----------------------+
|  A   | http://popular.com   |                       |
|  A   | http://popular.com   |                       |
|  A   | < 9997 more times>   |                       |
|  A   | http://popular.com   |                       |
|  A   | http://small-one.com |                       |
|  B   | http://tiny.com      |                       |
|  B   | http://tiny-too.com  |                       |

should produce the following result table with a threshold of 10,000

+------+------------------------------------+--------------------------+
| type | url                  | visit_count | < SUM/COUNT aggregates > |
+------+------------------------------------+--------------------------+
|  A   | http://popular.com   |       10000 |                          |
|  A   |                      |           1 |                          |
|  B   |                      |           2 |                          |

Summary:

Amazon Redshift has certain subquery correlation limitations one needs to tip-toe around. Gordon Linoff's answer below (the accepted answer) shows how to perform a GROUP BY a CASE expression using double aggregation and replicating the expression in both the result column and the outer GROUP BY clause.

with temp_counts as (SELECT type, url, COUNT(*) as cnt FROM t GROUP BY type, url)
select type, (case when cnt >= 10000 then url end) as url, sum(cnt) as cnt
from temp_counts
group by type, (case when cnt >= 10000 then url end)

Further testing indicated that the double aggregation can be "unrolled" into a UNION ALL of independent queries involving each independent CASE expression. In this particular case on a sample data set with approximately 200M rows, this approach consistently performed about 30% faster. That result is schema and data-specific, however.

with temp_counts as (SELECT type, url, COUNT(*) as cnt FROM t GROUP BY type, url)
select * from temp_counts WHERE cnt >= 10000
UNION ALL
SELECT type, NULL as url, SUM(cnt) as cnt from temp_counts 
WHERE cnt < 10000 
GROUP BY type

This suggests two general patterns for implementing and optimizing arbitrary disjoined grouping & summarization in Amazon Redshift. If performance is important for you, benchmark both.

Sim
  • 13,147
  • 9
  • 66
  • 95

2 Answers2

3

You would do this with two aggregations:

select type, (case when cnt > XXX then url end) as url, sum(cnt) as visit_cnt
from (select type, url, count(*) as cnt
      from t
      group by type, url
     ) t
group by type, (case when cnt > XXX then url end)
order by type, sum(cnt) desc;
Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
  • I played with this path at the beginning also but it generates the following error: An error occurred when executing the SQL command: select type, (case when cnt >=10000 then url end) as url, sum(cnt) as visit_cnt from (select type, url, ... ERROR: column "t.cnt" must appear in the GROUP BY clause or be used in an aggregate function [SQL State=42803] – Sim Dec 29 '13 at 00:47
  • I tracked down the error to this specific feature--references in a GROUP BY column to the results of a correlated subquery--not being supported by Amazon Redshift and updated the question with the info. – Sim Dec 29 '13 at 01:04
  • @Sim . . . Try the version with the expression in the `group by` (I just modified it). – Gordon Linoff Dec 29 '13 at 13:46
  • Hmm, if I externalize the nested SELECT to a WITH clause, the example above (with double aggregation) runs consistently about 30% slower than a UNION of two SELECTs: one WHERE cnt > XXX and the other of the form SELECT type, NULL as url, SUM(cnt) from temp_counts WHERE cnt < XXX GROUP BY 1. Does this strike you as an intuitive outcome? The final ordering should not matter much: the result has fewer than 10,000 rows. – Sim Dec 30 '13 at 01:47
  • What is `temp_counts`? You would need to look over the execution plans. The database knows more about "real" tables than subqueries, so it can better optimize code using them. – Gordon Linoff Dec 30 '13 at 02:03
  • In both cases I begin with WITH temp_counts as (SELECT type, url, COUNT(*) as cnt FROM t GROUP BY type, url) in order to make the comparison more even. Behind the scenes, Redshift sometimes creates an actual temporary table and sometimes does not (I can see that in the Redshift queries explorer). Regardless of that, the double GROUP BY runs consistently slower than a UNION of the two sides of the condition (cnt < XXX). – Sim Dec 30 '13 at 04:01
  • 1
    @Sim . . . Postgres materializes CTEs. So the aggregation is run only once. This probably significantly speeds up the `union` query. I would suggest doing a `union all` instead of `union` to be even faster. – Gordon Linoff Dec 30 '13 at 04:03
  • Yup, across a sample of several runs `UNION ALL` runs marginally faster. So, to summarize our findings, it looks like one possible way to optimize a `GROUP BY` on a `CASE` expression which collapses some rows may be to rewrite the query as a `UNION ALL` of each individual `WHEN` expression as an independent `WHERE` clause, with associated `GROUP BY`s to collapse rows where needed. If that makes sense then I will leave this as the accepted answer but will summarize our exchange in an update of the question. – Sim Dec 30 '13 at 04:27
  • @Sim . . . You conclusion is accurate. Sometimes, it *is* possible to outdo the optimizer by creating more complicated queries. This is an example. – Gordon Linoff Dec 30 '13 at 13:58
1
  1. First, you group on type, url.
  2. Then you group a 2nd time on type, case when visit_count < 10000 then NULL else url.

I've used SQL Server syntax, I hope that works for Postgres, too.

usr
  • 168,620
  • 35
  • 240
  • 369
  • Alas, that approach generates a SQL error. See my comments to Gordon's answer above. Looks like we need a different approach. – Sim Dec 29 '13 at 00:49
  • That is a strange error because Gordon's query does not use subqueries at all. It uses a derived table. There is no decorrelation necessary. Did you modify his query before running it? Maybe you added a subquery. – usr Dec 29 '13 at 10:55
  • Gordon's updated query works. Thanks for your answer and follow-up. – Sim Dec 30 '13 at 01:03