How to access truncated value in GROUP BY from inner query?

Question

I'm running into an interesting case where an inner query cannot access a truncated value used in the GROUP BY clause. How can I access a trunc-modified GROUP BY clause from the parent query?

Here's an abridged version:

SELECT
date_trunc('week', mainEvent.timestamp::date + 1)::date -1 AS weekly,
   (
        SELECT sum(p.value)
        FROM myschema.purchase as p
        WHERE p.non_unique_id = mainEvent.non_unique_id
              AND date_trunc('week', p.timestamp::date + 1)::date -1 
                = date_trunc('week', mainEvent.timestamp::date + 1)
        GROUP BY (date_trunc('week', p.timestamp::date + 1)::date -1)::date
   ) as percent_of_week
FROM myschema.event as mainEvent
WHERE mainEvent.internal_feed_name IS NOT NULL
GROUP BY weekly, mainEvent.non_unique_id;

this produces error subquery uses ungrouped column "mainevent.timestamp" from outer query Position: 1587

I tried changing the outer GROUP BY clause to:

GROUP BY date_trunc('week', mainEvent.timestamp::date + 1)::date -1, mainEvent.non_unique_id

but it returns the same error. However, if I remove the trunc from the outer GROUP BY

GROUP BY mainEvent.timestamp

referencing the outer-query's GROUP BY "works." This makes me think that this works in principal, but PostGres doesn't understand the mutated groupings. Maybe there's some syntactical sugar to make this work? Is there maybe a way to do this by creating a temp table instead? I also tried some variations of window functions sum(over) etc. but I still run into issues referencing the parent grouping column... I don't want to group by the raw timestamp, I want to group weekly.

Goal: I need to aggregate data where I do pseudocode: (x / sum(x) in week) as percent_of_week sum over weekly totals.

so some math divided by sum of the group. In reality, the subquery has a numerator tied to each unique (omitted here for brevity), divided by that sum across the primary group week, and that's why simply joining won't work. There's also no foreign keys in my schemas (raw analytics data).

hi, interesting, perhaps would be easier with a cte? https://www.postgresqltutorial.com/postgresql-cte/ — IronMan, Mar 05 '21 at 20:20
I subquery with `GROUP BY` is almost never what one intends -- because it can return multiple rows and generate an error. Sample data, desired results, and a clear explanation of the logic would help. — Gordon Linoff, Mar 05 '21 at 20:20
@GordonLinoff well, if that were the case, I'd get a pre-run error `subquery returns multiple rows`. This isn't the case, when I use the non-truncated group by, I get the expected 1 row count back. I use this pattern in other places and subqueries can access the parent `GROUP BY`. — FlavorScape, Mar 05 '21 at 20:23
What if you replace `date_trunc('week', mainEvent.timestamp::date + 1)` in the subquery with `weekly::date +1`? Or just `weekly` and make the other side of the comparision -2 instead of -1? That might allow the database to recognize that the value being compared depends on the `GROUP BY` column rather than the underlying column. — Fred, Mar 05 '21 at 21:04
@Fred I tried that, however the alias `weekly` is not available to the inner query. — FlavorScape, Mar 05 '21 at 21:15
Then I would use a CTE or derived table subquery to return (non_unique_ID, style_color, weekly) ungrouped, and reference that as mainEvent in your query above. — Fred, Mar 05 '21 at 21:22
It would be instrumental to disclose a minimal workable table definition (`CREATE TABLE` statement) and your version of Postgres. — Erwin Brandstetter, Mar 05 '21 at 22:27

Erwin Brandstetter · Accepted Answer · 2021-03-08T02:46:50.033

Multiple problems. Consider LEFT JOIN LATERAL instead like:

SELECT *
FROM  (
   SELECT non_unique_id
        , date_trunc('week', timestamp + interval '1 day')::date - 1 AS weekly
   FROM   myschema.event
   WHERE  internal_feed_name IS NOT NULL
   GROUP  BY 1, 2
   ) main_event
LEFT   JOIN LATERAL (
   SELECT sum(p.value) AS sum_value
   FROM   myschema.purchase p
   WHERE  p.non_unique_id = main_event.non_unique_id
   AND    p.timestamp >= main_event.weekly
   AND    p.timestamp <  main_event.weekly + 7
   ) p ON true;

In your original, the correlated subquery references the input column mainEvent.timestamp, which is ungrouped, as that logically happens before the aggregation. That's the immediate reason for your reported error message.

You cannot fix this by referencing the output column name weekly as that is not visible from within the correlated subquery. You must rewrite the query: First aggregate, then join to table purchase.

I suggest a cleaner LEFT JOIN LATERAL instead of the correlated subquery (though that would be possible now on the next query level).

Do not need add another GROUP BY in the subquery as it's supposed to produce exactly one row in any case. Just remove it.

I shortened the code by using positional references instead of repeating the expression in the SELECT list. That's optional. See:

When can we use an identifier number instead of its name in PostgreSQL?

And it's typically much more efficient to use a "sargable" expression like:

WHERE    p.timestamp >= main_event.weekly
AND      p.timestamp <  main_event.weekly + 7

That's assuming you want the week starting from weekly. Else shift both bounds accordingly.

About LATERAL:

What is the difference between LATERAL JOIN and a subquery in PostgreSQL?

Wow, this is great. Can you tell i'm an application engineer who normally doesn't do SQL, now doing analytics? lol. Well, I wish this was SARGable, but the timestamp isn't indexed and I can't change schemas because of the upstream system. Super helpful, searching for this kind of thing turns up so many unrelated results unless you already know what you're looking for. — FlavorScape, Mar 07 '21 at 23:27
the way I've found this described that helps me understand what a lateral does, is that it is like a `for each` loop. — FlavorScape, Mar 07 '21 at 23:28
@FlavorScape: And index on `purchase.timestamp` would go a long way for this query (if the table is big and the selection is small). But "sargable" predicates are typically faster than their ugly siblings even without index. They don't require an expression to be evaluated for every row, just simple comparison. — Erwin Brandstetter, Mar 07 '21 at 23:35
lateral join has saved my life! I was able to re-write cleaner queries of all the work I did last week in like 2 hours! Also, the data comes from a 3rd party processor and they break if I make schema changes... I know from a CS standpoint how indices and partitioning would make this query so much faster! Wish I could... — FlavorScape, Mar 08 '21 at 04:06

How to access truncated value in GROUP BY from inner query?

1 Answers1