0

I have been working with window functions a fair amount but I don't think I understand enough about how they work to answer why they behave the way they do.

For the query that I was working on (below), why am I required to take my aggregated field and add it to the group by? (In the second half of my query below I am unable to produce a result if I don't include "Events" in my second group by)

With Data as (
Select 
  CohortDate as month
  ,datediff(week,CohortDate,EventDate)  as EventAge
  ,count(distinct case when EventDate is not null then GUID end) as Events
From MyTable
where month >= [getdate():month] - interval '12 months'
group by 1, 2
order by 1, 2
  )

Select 
  month
  ,EventAge
  ,sum(Events) over (partition by month order by SubAge asc rows between unbounded preceding and current row) as TotEvents
from data
group by 1, 2, Events
order by 1, 2 

I have run into this enough that I have just taken it for granted, but would really love some more color as to why this is needed. Is there a way I should be formatting these differently in order to avoid this (somewhat non-intuitive) requirement?

Thanks a ton!

djdoerr
  • 3
  • 1
  • This was the most similar question I found, but it didn't seem to answer the question I was asking as I thought I had already included all fields I was grouping by as opposed to being required to add in the aggregated field as well! https://stackoverflow.com/questions/19601948/must-appear-in-the-group-by-clause-or-be-used-in-an-aggregate-function?rq=1 – djdoerr Mar 02 '20 at 18:37

1 Answers1

0

What you are looking for is presumably a cumulative sum. That would be:

select month, EventAge,
       sum(sum(Events)) over (partition by month
                              order by SubAge asc 
                              rows between unbounded preceding and current row
                             ) as TotEvents
from data
group by 1, 2
order by 1, 2 ;

Why? That might be a little hard to explain. Perhaps if you see the equivalent version with a subquery it will be clearer:

select me.*
       sum(sum_events) over (partition by month
                             order by SubAge asc 
                             rows between unbounded preceding and current row
                            ) as TotEvents
from (select month, EventAge, sum(events) as sum_events
      from data
      group by 1, 2
     ) me
order by 1, 2 ;

This is pretty much an exactly shorthand for the query. The window function is evaluated after aggregation. You want to sum the SUM of the events after the aggregation. Hence, you need sum(sum(events)). After the aggregation, events is no longer available.

The nesting of aggregation functions is awkward at first -- at least it was for me. When I first started using window functions, I think I first spent a few days writing aggregation queries using subqueries and then rewriting without the subqueries. Quickly, I got used to writing them without subqueries.

Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
  • Thanks @Gordon. The structure I had been using is the subquery version that you mentioned below. I am doing a sum([AggField1]) over (....) where [AggField1] = "Count(distinct ID) where condition is true". The biggest difference I see between what I was doing and what I see in your subquery version is that you aren't requesting a group by in your outermost query. When I updated my query to remove the group by 1,2, [sum_events] it works as expected. This was helpful in avoiding it. But still struggling to understand why it happens that way. Oh well, your way is better! – djdoerr Mar 02 '20 at 23:00
  • @djdoerr . . . Because window functions are evaluated *after* the aggregation. `events` doesn't exist after the aggregation (that you want), but `sum(events)` does. – Gordon Linoff Mar 02 '20 at 23:57