2

I've seen a lot of questions about this general error, but I don't get why I have it, maybe because of nested window functions...


With the below query, I get the error for Col_C, Col_D, ... and almost everything I tried

SQL compilation error: [eachColumn] is not a valid group by expression

SELECT
    Col_A,
    Col_B,
    FIRST_VALUE(Col_C) IGNORE NULLS OVER (PARTITION BY Col_A, Col_B
                                    ORDER BY Col_TimeStamp ASC 
                                    ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW),
    MAX(Col_D)                      OVER (PARTITION BY Col_A, Col_B
                                    ORDER BY Col_TimeStamp ASC
                                    ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW),
    FIRST_VALUE(CASE WHEN Col_T = 'testvalue'
                THEN LAST_VALUE(Col_E) IGNORE NULLS OVER (PARTITION BY Col_A, Col_B
                                                    ORDER BY Col_TimeStamp DESC 
                                                    ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
                ELSE NULL END) IGNORE NULLS 
                                    OVER (PARTITION BY Col_A, Col_B
                                    ORDER BY Col_TimeStamp ASC
                                    ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
FROM mytable

So, is there a way to used nested window functions in Snowflake (with case when ...) and if so, how/what am I doing wrong ?

R3uK
  • 14,417
  • 7
  • 43
  • 77
  • Select from select? – astentx Apr 08 '21 at 21:45
  • @astentx : yes, is that the issue ? – R3uK Apr 08 '21 at 21:46
  • I've seen only one database (Exasol) that can reuse local calculations on the same level (my colleague told me that Teradata can also do this). In most of the DBMSes this requires rewriting the query in step-by-step manner with CTE (`WITH` clause) or subquery in `FROM`, where analytic functions do not use analytic functions inside. So at each new `select`statement you use the result of analytic function of previous level – astentx Apr 08 '21 at 21:54
  • @astentx : thank you for your insight, if that's so I've got a long ride to refactor what I'm doing ! (T_T) – R3uK Apr 08 '21 at 22:00
  • are you trying to get the first_value() of the last_value() if condition met? I think if you explain you problem ( with sample data and desired output) we can find a solution – eshirvana Apr 08 '21 at 22:07
  • 1
    You should provide sample data, desired results, and a clear explanation of what you want to do. Nested window functions don't make sense to the compiler. They don't make sense to me either. – Gordon Linoff Apr 08 '21 at 22:57
  • 1
    It makes perfect sense to ME (this is a small version of what I spend my days writing in snowflake) the context would be murky if there was a GROUP BY, at which point the second layer of window functions needs the context to know "over what 'what'". But here the same number of rows is produced. AKA this is just perfectly lovely functional programming, which SF allows. – Simeon Pilgrim Apr 08 '21 at 23:56
  • @astentx The thing you mention is called "lateral column reference" and it is already possible in Snowflake `SELECT 1+1 AS col1, col1 *2 AS col2, col2 +1 AS col3`. Here the challenge is you cannot nest windowed function. – Lukasz Szozda Apr 09 '21 at 14:11

2 Answers2

2

So deconstructing your logic to show it's the second FIRST_VALUE that causes the problem

WITH data(Col_A,Col_B,Col_c,col_d, Col_TimeStamp, col_t,col_e) AS (
    SELECT * FROM VALUES
        (1,1,1,1,1,'testvalue',10),
        (1,1,2,3,2,'value',11)
)
SELECT
    Col_A,
    Col_B,
    FIRST_VALUE(Col_C) IGNORE NULLS OVER (PARTITION BY Col_A, Col_B 
                                    ORDER BY Col_TimeStamp ASC 
                                    ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as first_c,
    MAX(Col_D)                      OVER (PARTITION BY Col_A, Col_B
                                    ORDER BY Col_TimeStamp ASC
                                    ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW),
    LAST_VALUE(Col_E) IGNORE NULLS OVER (PARTITION BY Col_A, Col_B
                                    ORDER BY Col_TimeStamp DESC 
                                    ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as last_e,   
    IFF(Col_T = 'testvalue', last_e, NULL) as if_test_last_e
    /*,FIRST_VALUE(if_test_last_e) IGNORE NULLS OVER (PARTITION BY Col_A, Col_B 
                                    ORDER BY Col_TimeStamp ASC 
                                    ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as the_problem*/
FROM data
ORDER BY Col_A,Col_B, col_timestamp
;

if we uncomment the_problem we have it.. compare to PostgreSQL (my background) just getting to reuse so many prior results/steps is a gift, so here I just bust out another SELECT layer.

WITH data(Col_A,Col_B,Col_c,col_d, Col_TimeStamp, col_t,col_e) AS (
    SELECT * FROM VALUES
        (1,1,1,1,1,'testvalue',10),
        (1,1,2,3,2,'value',11)
)
SELECT *,
    FIRST_VALUE(if_test_last_e) IGNORE NULLS OVER (PARTITION BY Col_A, Col_B 
                                    ORDER BY Col_TimeStamp ASC 
                                    ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as not_a_problem
FROM (
    SELECT
        Col_A,
        Col_B,
        FIRST_VALUE(Col_C) IGNORE NULLS OVER (PARTITION BY Col_A, Col_B 
                                        ORDER BY Col_TimeStamp ASC 
                                        ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as first_c,
        MAX(Col_D)                      OVER (PARTITION BY Col_A, Col_B
                                        ORDER BY Col_TimeStamp ASC
                                        ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW),
        LAST_VALUE(Col_E) IGNORE NULLS OVER (PARTITION BY Col_A, Col_B
                                        ORDER BY Col_TimeStamp DESC 
                                        ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as last_e,   
        IFF(Col_T = 'testvalue', last_e, NULL) as if_test_last_e
        ,Col_TimeStamp
    FROM data
)
ORDER BY Col_A,Col_B, Col_TimeStamp

And then it all works. This also happens if you LAG then IFF/FIRST_VALUE and then LAG that second result.

Simeon Pilgrim
  • 22,906
  • 3
  • 32
  • 45
  • Thanks for your input, but I went with lateral references, as CTE seemed kind of clumsy and not so readable. – R3uK Apr 15 '21 at 05:54
  • @R3uK the CTE is only there to provide the "test data" it's not part of the actual solution, but I like to provide solutions that you can just copy into Snowflake and run as they are, so you can play with them to see how it works. – Simeon Pilgrim Apr 15 '21 at 19:23
  • also the "lateral reference" is just the techincal name for what you code, my code and Lukasz code are all doing witch, is referring to an output column in the SELECT section while in the same SELECT section, which is not valid ANSI SQL, but insanely helpful to keep the code clean. – Simeon Pilgrim Apr 15 '21 at 19:25
1

"I've seen a lot of questions about this general error, but I don't get why I have it, maybe because of nested window functions..."

Snowflake supports reusing expressions at the same level(sometimes called "lateral column alias reference" )

It is perfectly fine to write:

SELECT 1+1 AS col1,
       col1 *2 AS col2,
       CASE WHEN col1 > col2 THEN 'Y' ELSE 'NO' AS col3
       ...

In standard SQL you will either have to use multiple levels of query(cte) or use LATERAL JOIN. Related: PostgreSQL: using a calculated column in the same query


Unfortunately the same syntax will not work for analytic functions(and I am now aware of any RDMBS that supports it):

SELECT ROW_NUMBER() OVER(PARTITION BY ... ORDER BY ...) AS rn
      ,MAX(rn) OVER(PARTITION BY <different than prev) AS m
FROM tab;

In the SQL Standard 2016 there is optional feature: T619 Nested window functions.

Here an article how the nested analytic function query could look like: Nested window functions in SQL.

It means that current way to nest windowed function is usage of derived table/cte:

WITH cte AS (
    SELECT ROW_NUMBER() OVER(PARTITION BY ... ORDER BY ...) AS rn
           ,*
    FROM tab
)
SELECT  *, MAX(rn) OVER(PARTITION BY <different than prev) AS m
FROM cte
Lukasz Szozda
  • 162,964
  • 23
  • 234
  • 275
  • and a CTE is just a sub-select, which is to say, at some points you have to be explicit about the data scope you are using to help the complier know what you mean. – Simeon Pilgrim Apr 10 '21 at 00:18
  • 1
    That's the way I went, as CTE seemed clumsy and not so readable, I managed to what I need with lateral references and was still pretty much readable and efficient ;) – R3uK Apr 15 '21 at 05:51