SQL aggregate function alias

Question

I'm a beginner at SQL and this is the question I have been asked to solve:

Say that a big city is defined as a place of type city with a population of at least 100,000. Write an SQL query that returns the scheme (state_name,no_big_city,big_city_population) ordered by state_name, listing those states which have either (a) at least five big cities or (b) at least one million people living in big cities. The column state_name is the name of the state, no_big_city is the number of big cities in the state, and big_city_population is the number of people living in big cities in the state.

Now, as far as I can see, the following query returns correct results:

SELECT state.name AS state_name
     , COUNT(CASE WHEN place.type = 'city' AND place.population >= 100000 THEN 1 ELSE NULL END) AS no_big_city
     , SUM(CASE WHEN place.type = 'city' AND place.population >= 100000 THEN place.population ELSE NULL END) AS big_city_population
FROM state
JOIN place
ON state.code = place.state_code
GROUP BY state_name
    HAVING
        COUNT(CASE WHEN place.type = 'city' AND place.population >= 100000 THEN 1 ELSE NULL END) >= 5 OR
        SUM(CASE WHEN place.type = 'city' AND place.population >= 100000 THEN place.population ELSE NULL END) >= 1000000
ORDER BY state_name;

However, the two aggregate functions used in the code appear twice. MY question: is there any way of making this code duplication disappear preserving functionality?

To be clear, I have already tried using the alias, but I just get a "column does not exist" error.

Unfortunately I tried this and I just get an error saying that the first column referenced, `no_big_city`, does not exist... — Paul Benn, Nov 10 '16 at 21:06
@GordonLinoff: The `HAVING` clause only accepts *input* column names. I added the relevant quote from the manual below. — Erwin Brandstetter, Nov 10 '16 at 23:14

score 8 · Accepted Answer · edited May 23 '17 at 12:08

The manual clarifies:

An output column's name can be used to refer to the column's value in ORDER BY and GROUP BY clauses, but not in the WHERE or HAVING clauses; there you must write out the expression instead.

Bold emphasis mine.

You can avoid typing long expressions repeatedly with a subquery or CTE:

SELECT state_name, no_big_city, big_city_population
FROM  (
   SELECT s.name AS state_name
        , COUNT(*)        FILTER (WHERE p.type = 'city' AND p.population >= 100000) AS no_big_city
        , SUM(population) FILTER (WHERE p.type = 'city' AND p.population >= 100000) AS big_city_population
   FROM   state s
   JOIN   place p ON s.code = p.state_code
   GROUP  BY s.name -- can be input column name as well, best schema-qualified to avoid ambiguity
   ) sub
WHERE  no_big_city >= 5
   OR  big_city_population >= 1000000
ORDER  BY state_name;

While being at it, I simplified with the aggregate FILTER clause (Postgres 9.4+):

How can I simplify this game statistics query?

However, I suggest this simpler and faster query to begin with:

SELECT s.state_name, p.no_big_city, p.big_city_population
FROM   state s
JOIN  (
   SELECT state_code      AS code  -- alias just to simplify join
        , count(*)        AS no_big_city
        , sum(population) AS big_city_population
   FROM   place
   WHERE  type = 'city'
   AND    population >= 100000
   GROUP  BY 1  -- can be ordinal number referencing position in SELECT list
   HAVING count(*) >= 5 OR sum(population) >= 1000000  -- simple expressions now
   ) p USING (code)
ORDER  BY 1;    -- can also be ordinal number

I am demonstrating another option to reference expressions in GROUP BY and ORDER BY. Only use that if it doesn't impair readability and maintainability.

score 1 · Answer 2 · answered Nov 10 '16 at 21:27

Not sure if this is a comment or an answer, since it is more preference based as opposed to technical, but I'll post it anyway

What I usually do when I need to reference calculated columns (usually a LOT at the same time) is I put my calculated columns within a derived table and then reference the calculated columns using its alias outside of the derived table. This syntax should be ANSI-SQL correct, but I am not familiar with PostGRES

select * from (

SELECT STATE.NAME AS state_name
    ,COUNT(CASE WHEN place.type = 'city'
                AND place.population >= 100000 THEN 1 ELSE NULL END) AS no_big_city
    ,SUM(CASE WHEN place.type = 'city'
                AND place.population >= 100000 THEN place.population ELSE NULL END) AS big_city_population
FROM STATE
INNER JOIN place
    ON STATE.code = place.state_code
    GROUP BY state_name
) sub 
    where no_big_city >= 5 
        and big_city_population >=100000

--HAVING COUNT(CASE WHEN place.type = 'city'
--          AND place.population >= 100000 THEN 1 ELSE NULL END) >= 5
--  OR SUM(CASE WHEN place.type = 'city'
--              AND place.population >= 100000 THEN place.population ELSE NULL END) >= 1000000
ORDER BY state_name;

The nice thing about this approach is, although you are adding complication via a subquery/derived table, the formula is kept in one place, so any changes only have to happen once. I do not know if this will perform worse than simply repeating the calcuation in the group-by, but I can't imagine it would be that much worse.

Vadim · Answer 3 · 2016-11-10T21:49:04.337

SELECT clause is what you want to select from the filtred by WHERE clause table(s).
GROUP BY is a condition how to group filtered records to use in aggregation functions in the SELECT. So alias cannot be there. But you can wrap your filtered records and select from them. Something like that:

SELECT state_name, no_big_city, big_city_population 
FROM
 (
   SELECT 
     state.name AS state_name,     
     COUNT(1) no_big_city,
     MAX(place.population) max_city_population,
     SUM(place.population) AS big_city_population
   FROM state JOIN place ON state.code = place.state_code 
   WHERE   
     place.type = 'city' AND
     place.population >= 100000
   GROUP BY  state.name
  )
WHERE 
   no_big_city >= 5 OR
   max_city_population > 1000000
ORDER BY state_name

Also, moving conditions

   place.type = 'city' AND
   place.population >= 100000

out of CASE to WHERE will perform better. "No city" or "small city records will not be processed. especially if there is an index on place.type column.

SQL aggregate function alias

3 Answers3

Linked