Is there a way to replace a specific value in multiple columns to null in SQL snowflake?

Question

I have a table in snowflake where multiple data columns default empty value come through as 1900-01-01 which I import and then manually change these to null in R on my machine. However since I am dealing with 30M+ rows I want to try and do this in snowflake and not my local machine since it takes forever.

I know there is a replace() function that I can manually reference each column and replace 1900-01-01 with null however is there a way to reference all columns with data type equal to data and then run this replace() argument on?

In R we have tidyselect verbs so in a dataframe we can dynamically reference many columns based on patterns in the column name or column type - looking to see if there is something similiar in SQL?

NULLIF is the way todo it on one column `NULLIF(date_col,'1900-01-01'::date) as date_col` https://docs.snowflake.com/en/sql-reference/functions/nullif.html — Simeon Pilgrim, Aug 18 '22 at 01:30
But no is the simple answer because SQL is a SET logic, it default logic is each column is a different and meaningful thing, there are not "for all columns" like there are ARRAY logic of desktop computing. Thus why you have to in one form or another name all your columns. — Simeon Pilgrim, Aug 18 '22 at 01:32

Felipe Hoffa · Answer 1 · 2022-08-19T03:35:09.830

3

Let's do some magic with Python and Snowpark - as this is a simple way of dealing with multiple columns as the question asks.

But first, let's set up a table where we want to replace one value with null:

create or replace table sample_product_data 
as 
select 'a' a, 'b' b, 'c' c
union all select 'x', 'this is null', 'z'

Then this is a Python stored procedure in Snowflake that will take any value on that table equals to this is null, and will replace it with a null:

create or replace temporary procedure replace_this_is_null() 
returns VARIANT 
language python 
runtime_version=3.8 
packages=('snowflake-snowpark-python') 
handler='main' 
as 
$$

import snowflake.snowpark as snowpark

def main(session: snowpark.Session):
    tbn = 'sample_product_data'
    session.table(tbn).replace(
      'this is null', None).write.mode(
      'overwrite').save_as_table(tbn)
    return 'done'
$$;

Then you can call it with call replace_this_is_null() and it will work as expected.

Now, since the question wants to replace a date: Just import datetime, and instead of a string, compare with datetime.date(1900, 1, 1).

edited Aug 19 '22 at 03:35

answered Aug 18 '22 at 03:00

Felipe Hoffa

54,922
16
151
325

1

well played.. I like that you solved it. I find myself triggered by question of the nature of "I have massive json blobs, turn them into tables with one SP that does all different shapes of data" or this cousin question, of "how do I fix lots of stuff, generically" I will give you some internet points/love – Simeon Pilgrim Aug 18 '22 at 03:54
1

Thanks Simeon! The fun thing with these Snowpark libraries is that it should perform at scale (dataframes get rewritten internally) – Felipe Hoffa Aug 18 '22 at 04:01
3

Nice usage of Snowpark. For anyone wondering if similar is possible with pure SQL. Yes using dynamic SQL(building the query from metadata) - tedious. Second approach is usage of **Polymorphic table function(PTF)** which are part of SQL:2016 standard- unfortuntely not available in Snowflake yet. It literally solves entire class of cases where dynamic resultsets is expected like: reading CSV files, truly dynamic PIVOT, `SELECT EXCEPT` etc. For this case it would be: `CREATE OR REPLACE TABLE ... AS SELECT FROM my_ptf(table_name, datatype, new_default)` – Lukasz Szozda Aug 18 '22 at 14:14
(cont). The `desribe` component of the PTF is very powerful concept as it allows to determine the resultset schema **at runtime**. [Sample of PTF](https://stackoverflow.com/a/49015504/5070879) and [Polymorphic Table Functions](https://www.databricks.com/session_eu20/polymorphic-table-functions-the-best-way-to-integrate-sql-and-apache-spark) – Lukasz Szozda Aug 18 '22 at 14:18

score 2 · Accepted Answer · answered Aug 18 '22 at 16:50

You use can do this in Snowflake using R's tidyverse packages which your already familiar with.

The dbplyr package extends the dplyr package to support converting dplyr verbs to their SQL equivalent and executing them in the database. Dbplyr supports Snowflake as a database for in-database execution.

To demonstrate first with the data example provided by Felipe Hoffa.

library(odbc)
library(DBI)
library(dbplyr)
library(dplyr)
library(lubridate)

# Snowflake Database Connection details
server    <- "<your snowflake account here>" e.g."demo43.snowflakecomputing.com"
uid       <- "<your user name>"
database  <- "<your database>"
schema    <- "<your schema>"
warehouse <- "<your virtual warehouse>"
pwd       <- "<your password>"

# Obtain ODBC Connection
con <- dbConnect(odbc::odbc(), 
                 .connection_string = 
                     sprintf("Driver={Snowflake};server={%s};uid={%s};
                             pwd={%s};database={%s};schema={%s};warehouse={%s}", 
                               server, uid, pwd, database, schema, warehouse )  , 
                     timeout = 10)

# Create a tbl referencing felipes sample database table in Snowflake
df_product <- tbl(con, "SAMPLE_PRODUCT_DATA")

# First we will get the data to the client R environment to show dplyr 
# functionality running  on a local dataframe. 
(df_product_local <- df_product %>% collect())

#> #A tibble: 2 × 3
#>  A     B            C    
#>  <chr> <chr>        <chr>
#>  1 a     b            c    
#>  2 x     this is null z

Now use dplyr verbs to convert the value 'this is null' to NA on the local dataframe

df_product_local %>% mutate(across(everything(), ~na_if(., 'this is null')))

#> # A tibble: 2 × 3
#>   A     B     C    
#>   <chr> <chr> <chr>
#> 1 a     b     c     
#> 2 x     NA    z

and execute the same code replacing the local dataframe for the tbl referencing the Snowflake table

df_product %>% mutate(across(everything(), ~na_if(., 'this is null')))

#> # Source:   SQL [2 x 3]
#> # Database: Snowflake 6.28.0[SFIELD@Snowflake/SF_TEST]
#>   A     B     C    
#>   <chr> <chr> <chr>
#> 1 a     b     c    
#> 2 x     NA    z

and if you want to process the transformation in Snowflake and return the cleaned result to your local R environment for further local processing

df_product_cleaned <-  df_product %>% 
                       mutate(across(everything(), ~na_if(., 'this is null'))) %>%
                       collect()
head(df_product_cleaned)
#> # A tibble: 2 × 3
#>   A     B     C    
#>   <chr> <chr> <chr>
#> 1 a     b     c    
#> 2 x     NA    z

Now let's apply the same approach to the original date problem you have.

# First we create a table with mixed data; character and date columns.
mix_tblname = "SAMPLE_MIXED"
sql_ct <- sprintf("create or replace table %s as 
                   select 'a' a, 'b' b, 'c' c, 
                          '1900-01-01'::DATE x, '2022-08-17'::DATE y, '1900-01-01'::DATE z
                   union all 
                   select 'x', 'this is null', 'z',
                          '2022-08-17'::DATE, '1900-01-01'::DATE, '2022-08-15'::DATE",
                  mix_tblname )
dbExecute(con, sql_ct)  

# And reference the new table with a database tbl
df_mixed <- tbl(con, mix_tblname)
df_mixed_local <- df_mixed %>% collect()

# Check the raw data looks OK
head(df_mixed)
#> # Source:   SQL [2 x 6]
#> # Database: Snowflake 6.28.0[SFIELD@Snowflake/SF_TEST]
#>   A     B            C     X          Y          Z         
#>   <chr> <chr>        <chr> <date>     <date>     <date>    
#> 1 a     b            c     1900-01-01 2022-08-17 1900-01-01
#> 2 x     this is null z     2022-08-17 1900-01-01 2022-08-15

The code below fails because we have columns of mixed type. And the non Date columns cannot be coerced to a DATE

df_mixed %>% mutate(across(everything(), ~na_if(., TO_DATE('1900-01-01', 'YYYY-MM-DD'))))

We could instead implicitly convert all columns to character and evaluate as a character expression.

df_mixed %>% mutate(across(everything(), ~na_if(.,'1900-01-01'))) 

#> # Source:   SQL [2 x 6]
#> # Database: Snowflake 6.28.0[SFIELD@Snowflake/SF_TEST]
#> A     B            C     X          Y          Z         
#> <chr> <chr>        <chr> <date>     <date>     <date>    
#>   1 a     b            c     NA         2022-08-17 NA        
#> 2 x     this is null z     2022-08-17 NA         2022-08-15

Although this works, it will pick other column types containing the same value, which you may not want. So we need a way of identifying the DATE columns.

Heres the way I can do that on a local dataframe

df_mixed_local %>% mutate(across(where(~ is.Date(.x)), ~na_if(.,'1900-01-01')))
#> # A tibble: 2 × 6
#>   A     B            C     X          Y          Z         
#>   <chr> <chr>        <chr> <date>     <date>     <date>    
#> 1 a     b            c     NA         2022-08-17 NA        
#> 2 x     this is null z     2022-08-17 NA         2022-08-15

But it doesn't work for a Database tbl. You can see the SQL generated here is clearly missing the column wise transformations.

df_mixed %>% mutate(across(where(~ is.Date(.x)), ~na_if(.,'1900-01-01'))) %>% show_query()
#> <SQL>
#> SELECT *
#> FROM "SAMPLE_MIXED"

I tried a few things but couldn't find a TIDY way of filtering on the Date types so instead...

We can get a vector of the date columns from Snowflakes Information Schema

## Switch session to the Information Schema
dbExecute(con, 'USE SCHEMA INFORMATION_SCHEMA')
dateCols <- tbl(con, 'COLUMNS') %>%
            filter(TABLE_CATALOG == database,
                   TABLE_SCHEMA == schema,
                   TABLE_NAME == mix_tblname,
                   DATA_TYPE == 'DATE') %>%
            select(COLUMN_NAME) %>%
            arrange(ORDINAL_POSITION) %>% 
            pull()
## Switch session back to our data schema
dbExecute(con, sprintf('USE SCHEMA %s',schema ))

Now using dateCols we can selectively apply our transformation to only the DATE columns

df_mixed %>% mutate(across(all_of(dateCols), ~na_if(.,TO_DATE('1900-01-01', 'YYYY-MM-DD')))) 

#> # Source:   SQL [2 x 6]
#> # Database: Snowflake 6.28.0[SFIELD@Snowflake/SF_TEST]
#>   A     B            C     X          Y          Z         
#>   <chr> <chr>        <chr> <date>     <date>     <date>    
#> 1 a     b            c     NA         2022-08-17 NA        
#> 2 x     this is null z     2022-08-17 NA         2022-08-15

If anyone finds the TIDY way of applying a DATE data-type filter over the input columns I'd be interested to see it.

hi! thank you so much very very helpful. In regards to your question on tidy way to select column types that are date the below works for me: `select(where(~class(.)=="Date"))` — alejandro_hagan, Aug 19 '22 at 04:00

Is there a way to replace a specific value in multiple columns to null in SQL snowflake?

2 Answers2