1

I want to compare multiple date columns with each other and create a new column "latest_date", which is basically the greatest of them all.

For this we could use the GREATEST() function (DOC: https://docs.snowflake.com/en/sql-reference/functions/greatest). The problem is that this function's output is null if any of the compared columns is null.

I have read multiple solutions for this here:

Most of them propose to use the "COALESCE" function as follows:

GREATEST(
    COALESCE(a, b, c),
    COALESCE(b, c, a),
    COALESCE(c, a, b)
)

This is OK if we have just 2 or 3 columns, but does not escalate well when handling many columns (some with them containing nulls).

In my use case, I'm using Snowpark Dataframes (along with Snowflake SQL) and I handle most of these date columns in my tables/dataframes in a "generic way" so I don't know in advance how many columns I will be comparing.

Could someone think of a workaround for this?

  • 1
    https://stackoverflow.com/a/74529349/43992 – Simeon Pilgrim Aug 25 '23 at 23:46
  • 1
    [How to use `GREATEST()` in Snowflake with null values?](https://stackoverflow.com/questions/74527418/how-to-use-greatest-in-snowflake-with-null-values) - my personal favorite is ARRAY_MAX/ARRAY_MIN - https://stackoverflow.com/a/76689938/5070879 – Lukasz Szozda Aug 26 '23 at 08:24
  • This is somehow useful. The problem here is that the casting to array/variant is not straightforward in Snowpark as it is in plain SQL (Snowflake). I guess the best solution would be to join the values in the multiple date columns into one array_variant column and the use 'ARRAY_MAX".. – Nicolas Sanchez Aug 28 '23 at 16:09
  • ARRAY_MAX is not yet available as a function in Snowpark (< 1.6.1) in snowflake. – Nicolas Sanchez Aug 28 '23 at 18:31

2 Answers2

1

You can do this with ponder / modin with normal pandas syntax. After connecting to snowflake, we create a temp table on Snowflake with two date columns (you would not have to do this if you already have the data in Snowflake):

    >> df = modin.pandas.DataFrame(data={'A': ['1993-06-09', '1993-11-24'], 'B': ['1993-11-24', np.NaN]})
    >> date_df = df.astype('datetime64[ns]')
    >> date_df 
               A          B
    0 1993-06-09 1993-11-24
    1 1993-11-24        NaT

Then max() should work on top of Snowflake.

    >> date_df.max() 
    A   1993-11-24
    B   1993-11-24
    Name: 0, dtype: datetime64[ns]
    >> date_df.min(axis=1)
    0   1993-06-09
    1          NaT
    dtype: datetime64[ns]

Unfortunately, because this is executing on Snowflake, Ponder still is not handling NaN values exactly the way pandas would (bug filed), so a fillna is needed (and it should scale because Ponder is essentially doing the same coalesce in the underlying generated SQL.

>> date_df.fillna('1970-01-01').max(axis=1)
0   1993-11-24
1   1993-11-24

Another solution might be to write an apply lambda and execute that on Snowflake through ponder. The lambda gets translated into a UDF which is executed on the warehouse side (without copying the data). I've filed a bug for the NaT handling here w/ max - in general, Ponder should match the semantics of pandas.

John Kew
  • 11
  • 1
  • Ok, this makes sense. We would need a Snowpark.DataFrame to pd.DataFrame conversion though, which is not recommendable usually in terms on performance on the server side. I guess this is a valid solution though. – Nicolas Sanchez Aug 29 '23 at 21:42
0

you can use

GREATEST(
IFNULL(a,to_timestamp('01/01/0000 00:00:00')
, IFNULL(b,to_timestamp('01/01/0000 00:00:00')
, IFNULL(C,to_timestamp('01/01/0000 00:00:00')
)

So you don't need to try all combinations.

nbk
  • 45,398
  • 8
  • 30
  • 47