2

I'm trying to use GREATEST() in Snowflake, but whenever I have null values, I get null instead of the desired result:

select greatest(1,2,null);

-- null

This behavior has confused many, and it begins with the behavior of GREATEST() in Oracle, which Snowflake matches:

It has also being discussed in the Snowflake forums:

Sample data:

create or replace table some_nulls
as (
    select $1 a, $2 b, $3 c
    from values(1.1, 2.3, null::float), (null, 2, 3.5), (1, null, 3), (null, null, null)
);

select greatest(a, b)
from some_nulls;

enter image description here

Asking here to get the best available solution.

Felipe Hoffa
  • 54,922
  • 16
  • 151
  • 325

6 Answers6

7

It is possible to handle nullable columns with ARRAYs:

SELECT a,b,c, GREATEST([a],[b],[c])[0]::INT
FROM some_nulls;

For sample data:

CREATE OR REPLACE TABLE some_nulls(a INT, b INT, c INT)
AS
SELECT 1, 2, NULL UNION
SELECT NULL, 2, 3 UNION
SELECT 1, NULL, 3 UNION
SELECT NULL, NULL, NULL;

Output:

enter image description here


How it works:

NULL becomes [undefined] which is the lowest element, therefore this approach could be used for GREATEST but NOT for LEAST function.

EDIT: As per Denis' comment: -(GREATEST( [-a] ,[-b], [-c] )[0]::INT) finds LEAST value


EDIT: Much cleaner approach is possible with recently introduced ARRAY_MIN/MAX functions: https://stackoverflow.com/a/76689938/5070879

Lukasz Szozda
  • 162,964
  • 23
  • 234
  • 275
  • 2
    This is fascinating, I love it. Just to show it doesn't need a cast to int, float works too: `SELECT a,b,c, GREATEST([a],[b],[c])[0]::float FROM some_nulls;` – Felipe Hoffa Nov 22 '22 at 18:42
  • 1
    (Side note: I see you have been fighting these monsters for many years https://stackoverflow.com/a/46267039/132438) – Felipe Hoffa Nov 22 '22 at 18:43
  • 1
    @FelipeHoffa Nice catch! Completely forgot about this entry :) – Lukasz Szozda Nov 22 '22 at 18:45
  • 1
    Since this answer is marked as correct, I will add a trick I found in other answers below: `-(GREATEST( [-a] ,[-b], [-c] )[0]::INT` finds LEAST value - notice double negation. – Denis Kokorin Feb 15 '23 at 12:55
  • @DenisKokorin Nice idea for LEAST. Thank you. – Lukasz Szozda Feb 15 '23 at 15:12
  • The `LEAST` trick seems to work with INTEGER / FLOAT types but I wasn't able to apply it on TIMESTAMP type (the minus sign is not valid). any ideas how to pull the `LEAST` with TIMESTAMP columns? – shayms8 Jun 08 '23 at 09:07
  • @shayms8 The usage of arrays was intended as a "workaround"(hackish). It will not work for non-numeric data types. – Lukasz Szozda Jun 08 '23 at 09:45
4

GREATEST and LEAST with IGNORE NULLS could be emulated by ARRAY_MIN and ARRAY_MAX functions:

SELECT GREATEST(1,2,NULL), LEAST(1,2,NULL), 
       ARRAY_MAX([1,2,NULL]), ARRAY_MIN([1,2,NULL]);
-- NULL NULL 2 1

Output:

enter image description here

Returned output has VARIANT data type and it should be casted explicitly: ARRAY_MAX([...])::NUMBER.

Function Behavior equivalent
GREATEST(...) GREATEST(...) RESPECT NULLS
LEAST(...) LEAST(...) RESPECT NULLS
ARRAY_MAX(...) GREATEST(...) IGNORE NULLS
ARRAY_MIN(...) LEAST(...) IGNORE NULLS

This approach works also for columns:

CREATE OR REPLACE TABLE tab(col1, col2, col3) AS
SELECT 1 ,2, NULL
UNION ALL SELECT 10, 20, 30;

SELECT *,
       GREATEST(col1, col2, col3),
       ARRAY_MAX([col1, col2, col3])::NUMBER
FROM tab;

enter image description here


These functions are present but not enabled by default:

SHOW FUNCTIONS LIKE 'ARRAY_M%';

ARRAY_MAX(ARRAY) RETURN VARIANT the maximum defined element in the array, or NULL
ARRAY_MIN(ARRAY) RETURN VARIANT the minimum defined element in the array, or NULL

2023_05 Bundle

SELECT SYSTEM$ENABLE_BEHAVIOR_CHANGE_BUNDLE('2023_05');

Status changed in 7.29 release (August 22-23) to Enabled by Default; account admins can disable for opt-out.

Lukasz Szozda
  • 162,964
  • 23
  • 234
  • 275
2

One solution could be to create a UDF that picks the greatest() or the first non null:

create or replace function greatest2(x1 float, x2 float)
returns float
as $$
    coalesce(greatest(x1, x2), x1, x2)
$$;

select greatest2(a, b)
from some_nulls;

enter image description here

However things get more complex if you need to compare multiple values. For example, if you want to compare 3 columns, then you have to create a custom UDF with 3 arguments and check each for null:

create or replace function greatest3(x1 float, x2 float, x3 float)
returns float
as $$
    select iff(x='-inf', null, x)
    from (
        select greatest(nvl(x1, '-inf'), nvl(x2, '-inf'), nvl(x3, '-inf')) x
    )
$$;

select greatest3(a, b, c)
from some_nulls;

enter image description here

Felipe Hoffa
  • 54,922
  • 16
  • 151
  • 325
0

Documenting here an approach that doesn't work (to save others time, or an opportunity to fix): Arrays in a SQL UDF.

create or replace function greatest_a(arr array)
returns float
immutable
as $$
    select max(value::float)
    from table(flatten(arr))
$$;

select greatest_a([null,2,3.3])
from some_nulls;

This works until you try to create the array with values from the table.

select greatest_a([a, b, c])
from some_nulls;

-- Unsupported subquery type cannot be evaluated

A similar approach with a JS UDF would work, but it will be slower than a pure SQL UDF.

Felipe Hoffa
  • 54,922
  • 16
  • 151
  • 325
  • 1
    Ah, I see you point, you mentioned on my answer, I dislike functions as they tend to be trying to make SQL neat and tidy very explicit about how you do the work. And the most often end up with a correlated sub-query, which are performance yuck – Simeon Pilgrim Nov 22 '22 at 19:26
0

Welcome user: Felipe

The doc's seem to describe the results exactly:

Returns the largest value from a list of expressions. If any of the argument values is NULL, the result is NULL. GREATEST supports all data types, including VARIANT.

If you ORDER BY data, NULL is last, GREATEST is returning the LAST value..

Snowflake is rather consistent

select 
    a,b,c
    ,greatest(a, b) as g_a_b
    ,greatest(a, c) as g_a_c
    ,greatest(b, c) as g_b_c
from values
    (1.1, 2.3, null::float), 
    (null, 2, 3.5), 
    (1, null, 3), 
    (null, null, null)
    t(a,b,c)

gives:

A B C G_A_B G_A_C G_B_C
1.1 2.3 null 2.3 null null
null 2 3.5 null null 3.5
1 null 3 null 3 null
null null null null null null

so your min value -inf solution is interesting/gross, I mean it's correct.

But what GREATEST is doing is handling a fixed number of SQL declared columns, so the NVL soltuion works: But to make a generic solution, building the fix arrays, and then flattening, and the maxing as max handles NULL in the implicit way the problem implies which "them not being selected"

select a,b,c, max(f.value)
from (
    select 
        a,b,c
        ,array_construct_compact(a, b, c) as aa
    from values
        (1.1, 2.3, null), 
        (null, 2, 3.5), 
        (1, null, 3), 
        (null, null, null)
        t(a,b,c)
), table(flatten(input=>aa)) as f
group by 1,2,3,f.seq

gives:

A B C MAX(F.VALUE)
1 null 3 3
null 2 3.5 3.5
1.1 2.3 null 2.3
Simeon Pilgrim
  • 22,906
  • 3
  • 32
  • 45
  • 1
    The expectation that GREATEST skips NULL values is completely understandable: 1) `If you ORDER BY data, NULL is last, GREATEST is returning the LAST value..` - Using this argument then [LEAST](https://docs.snowflake.com/en/sql-reference/functions/least.html) function should return the lowest value(Nulls are at the end) and it returns NULL – Lukasz Szozda Nov 22 '22 at 17:35
  • 1
    2) MAX/MIN aggregate functions skip NULLs (by analogy GREATEST/LEAST are the scalar equivalents) 3) Implementations: [PostgreSQL](https://www.postgresql.org/docs/15/functions-conditional.html#FUNCTIONS-GREATEST-LEAST), [SQL Server](https://learn.microsoft.com/en-us/sql/t-sql/functions/logical-functions-greatest-transact-sql?view=azure-sqldw-latest#remarks) and [Databricks](https://docs.databricks.com/sql/language-manual/functions/greatest.html) ignore NULLS. – Lukasz Szozda Nov 22 '22 at 17:36
  • Thanks Simeon! The problem I have with flattening arrays is that they won't run inside a SQL UDF https://stackoverflow.com/a/74527633/132438 – Felipe Hoffa Nov 22 '22 at 18:45
  • @LukaszSzozda your "LEAST does not do the opposite" point is very valid (aka I will just believe you) where-as MAX/MIN verse GREATEST/LEAST are different "things" just like LAST_VALUE is not an aggregation function. I guess it is "rather gross" to not have a uniform handling, albeit the NULL case does feel like a floating NaN and the aggregation function have a free lunch handling, and the non-aggregation function don't.. At some level just understanding what is happening verse extension of weak implicit knowledge is needed by all. – Simeon Pilgrim Nov 22 '22 at 19:24
  • 1
    @SimeonPilgrim The GREATEST/LEAST are not defined in SQL Standard and both implementations are "correct". The behavior of GREATEST/LEAST but also CONCAT_WS simply follows [RETURNS NULL ON NULL INPUT](https://docs.snowflake.com/en/sql-reference/sql/create-function.html#optional-parameters), I wish we had overloaded versions of these functions that skip nulls. – Lukasz Szozda Nov 22 '22 at 19:36
  • 1
    @LukaszSzozda yes, agreed, variants of function that allow the edge case explicitly would be best of both worlds, as both have use cases, and finding the function doesn't behave as expected "by surprise" is always a shocker. – Simeon Pilgrim Nov 22 '22 at 19:38
  • 1
    @SimeonPilgrim Now `GREATEST` and `LEAST` are part of SQL:2023 standard and the default behavior is `RESPECT NULLS`. It would be nice to have option `RESPECT/IGNORE NULLS` but I found an alternative approach to handle it: https://stackoverflow.com/a/76689938/5070879 – Lukasz Szozda Jul 14 '23 at 17:25
0

Here's a yucky approach, but at least:

  • It works for both GREATEST & LEAST.
  • It doesn't require any synthetic values, like -inf (i.e. it can only return values from the fields themselves, and will never convert any valid values to NULL).
  • It scales (uglily) to any number of parameters (it just gets harder/uglier to write with more parameters).
  • The result will only be NULL if all the fields are NULL.
WITH test AS (
  SELECT
    column1 AS a,
    column2 AS b,
    column3 AS c,
    column4 AS d
  FROM VALUES
    ('aaa', 'bbb', 'ccc', NULL),
    ('aaa', 'bbb', NULL, 'ccc'),
    ('aaa', NULL, 'bbb', 'ccc'),
    (NULL, 'aaa', 'bbb', 'ccc'),
    (NULL, NULL, NULL, NULL)
)
SELECT
  GREATEST(COALESCE(a, b, c, d), COALESCE(b, c, d, a), COALESCE(c, d, a, b), COALESCE(d, a, b, c)) AS _greatest,
  LEAST(COALESCE(a, b, c, d), COALESCE(b, c, d, a), COALESCE(c, d, a, b), COALESCE(d, a, b, c)) AS _least
FROM test;

We have to COALESCE all the values N times, each time starting with a different column (in revolving order, like a circular list), so that:

  • If all the columns do have a value, all the values are compared (i.e. the first column in the COALESCE) to find the legitimate GREATEST/LEAST.
  • If any of the columns contain NULL, it never surfaces (unless all of the values are NULL).

Result:

result

We could put the code in SQL UDFs (which support overloading for various number of parameters):

  • But then we'd also have to deal with specifying types, or handling automatic type conversion (unless everything is passed as VARIANT).
  • And it can hurt performance.

I wish we didn't have to do this, even if that meant we'd have to call GREATEST2(...) or LEAST2(...). Hopefully, Snowflake will make this a reality someday!

Marco Roy
  • 4,004
  • 7
  • 34
  • 50