Snowflake count nulls in all columns

Question

I've seen a few questions like this - Count NULL Values from multiple columns with SQL

But is there really not a way to count nulls in a table with say, over 30 columns? Like I don't want to specify them all by name?

Are you saying that you want to count `NULL`s in the entire table, across all columns, and then get a single value? You'll probably have to use a UDF or Stored Proc for this if you don't want to specify each column by name with a pivot or somethin' — Simon D, Mar 08 '21 at 11:27
yes exactly that. I don't understand why it's so difficult - it's like 1 line in pandas? — Christopher Turnbull, Mar 08 '21 at 13:29
I don't think it's doable in a single query without enumerating them (maybe through a complicated nested query on the information schema?). However...why not using python (or node) to dynamically get all column names and create/run your query? — Luca T, Mar 08 '21 at 14:16
Because SQL and pandas are very different tools with different strengths and weaknesses; plus you probably find it easy in pandas because you already know the solution in pandas and difficult in Snowflake because you don't, yet, know how to write the SP that will achieve what you want — NickW, Mar 08 '21 at 14:18

Lukasz Szozda · Accepted Answer · 2021-04-03T15:06:30.737

But is there really not a way to count nulls in a table with say, over 30 columns? Like I don't want to specify them all by name?

yes exactly that. I don't understand why it's so difficult - it's like 1 line in pandas?

Keypoint here is if something is not provided as "batteries included" then you need to write your own version. It is not so hard as it may look.

Let's say the input table is as follow:

CREATE OR REPLACE TABLE t AS SELECT $1 AS col1, $2 AS col2, $3 AS col3, $4 AS col4 
FROM VALUES (1,2,3,10),(NULL,2,3,10),(NULL,NULL,4,10),(NULL,NULL,NULL,10);

SELECT * FROM t;
/*
+------+------+------+------+
| COL1 | COL2 | COL3 | COL4 |
+------+------+------+------+
| 1    | 2    | 3    |   10 |
| NULL | 2    | 3    |   10 |
| NULL | NULL | 4    |   10 |
| NULL | NULL | NULL |   10 |
+------+------+------+------+
*/

You probably know how to write the query that gives the desired output, but as it was not provided in the question I will use my own version:

WITH cte AS (
  SELECT
      COUNT(*) AS total_rows
      ,total_rows - COUNT(col1) AS col1
      ,total_rows - COUNT(col2) AS col2
      ,total_rows - COUNT(col3) AS col3
      ,total_rows - COUNT(col4) AS col4
  FROM t
)
SELECT COLUMN_NAME, NULLS_COLUMN_COUNT,SUM(NULLS_COLUMN_COUNT) OVER() AS NULLS_TOTAL_COUNT
FROM cte
UNPIVOT (NULLS_COLUMN_COUNT FOR COLUMN_NAME IN (col1,col2,col3, col4))
ORDER BY COLUMN_NAME;
/*
+-------------+--------------------+-------------------+
| COLUMN_NAME | NULLS_COLUMN_COUNT | NULLS_TOTAL_COUNT |
+-------------+--------------------+-------------------+
| COL1        |                  3 |                 6 |
| COL2        |                  2 |                 6 |
| COL3        |                  1 |                 6 |
| COL4        |                  0 |                 6 |
+-------------+--------------------+-------------------+
*/

Here we could see that the query is "static" in nature with few moving parts(column_count_list/table_name/column_list):

WITH cte AS (
  SELECT
      COUNT(*) AS total_rows
      <column_count_list>
  FROM <table_name>
)
SELECT COLUMN_NAME, NULLS_COLUMN_COUNT,SUM(NULLS_COLUMN_COUNT) OVER() AS NULLS_TOTAL_COUNT
FROM cte
UNPIVOT (NULLS_COLUMN_COUNT FOR COLUMN_NAME IN (<column_list>))
ORDER BY COLUMN_NAME;

Now using the metadata and variables:

-- input
SET sch_name = 'my_schema';
SET tab_name = 't';

SELECT
     LISTAGG(c.COLUMN_NAME, ', ') WITHIN GROUP(ORDER BY c.COLUMN_NAME) AS column_list
    ,ANY_VALUE(c.TABLE_SCHEMA || '.' || c.TABLE_NAME) AS full_table_name
    ,LISTAGG(REPLACE(SPACE(6) || ',total_rows - COUNT(<col_name>) AS <col_name>' 
                              || CHAR(13)
             , '<col_name>', c.COLUMN_NAME), '') 
     WITHIN GROUP(ORDER BY COLUMN_NAME) AS column_count_list

    ,REPLACE(REPLACE(REPLACE(
'WITH cte AS (
  SELECT
      COUNT(*) AS total_rows
<column_count_list>
  FROM <table_name>
)
SELECT COLUMN_NAME, NULLS_COLUMN_COUNT,SUM(NULLS_COLUMN_COUNT) OVER() AS NULLS_TOTAL_COUNT
FROM cte
UNPIVOT (NULLS_COLUMN_COUNT FOR COLUMN_NAME IN (<column_list>))
ORDER BY COLUMN_NAME;'
    ,'<column_count_list>',     column_count_list)
    ,'<table_name>',            full_table_name)
    ,'<column_list>',           column_list) AS query_to_run

FROM INFORMATION_SCHEMA.COLUMNS c
WHERE TABLE_SCHEMA = UPPER($sch_name)
  AND TABLE_NAME = UPPER($tab_name);

Running the code will generate the query to be run:

Copying the output and running it will give the output. This template could be further refined and wrapped with stored procedure if needed(but I will left it as an exercise).

is it possible to write one query that automatically takes and run the query_to_run? instead of having to copy and run it manually — x89, Nov 21 '22 at 12:44

score 0 · Answer 2 · answered Mar 10 '21 at 12:37

@chris you should note that the metadata in Snowflake is similar to SQL Server. So anything you want to know at metadata level, would have already been solved by SQL Server practitioners. See this link - Count number of NULL values in each column in SQL This is different in Oracle where the metadata table gives the number of nulls in each column as well as density.

Snowflake count nulls in all columns

2 Answers2

Linked