6

I've seen a few questions like this - Count NULL Values from multiple columns with SQL

But is there really not a way to count nulls in a table with say, over 30 columns? Like I don't want to specify them all by name?

  • Are you saying that you want to count `NULL`s in the entire table, across all columns, and then get a single value? You'll probably have to use a UDF or Stored Proc for this if you don't want to specify each column by name with a pivot or somethin' – Simon D Mar 08 '21 at 11:27
  • yes exactly that. I don't understand why it's so difficult - it's like 1 line in pandas? – Christopher Turnbull Mar 08 '21 at 13:29
  • I don't think it's doable in a single query without enumerating them (maybe through a complicated nested query on the information schema?). However...why not using python (or node) to dynamically get all column names and create/run your query? – Luca T Mar 08 '21 at 14:16
  • 2
    Because SQL and pandas are very different tools with different strengths and weaknesses; plus you probably find it easy in pandas because you already know the solution in pandas and difficult in Snowflake because you don't, yet, know how to write the SP that will achieve what you want – NickW Mar 08 '21 at 14:18
  • How do I write the SP? – Christopher Turnbull Mar 09 '21 at 10:11
  • How would you do this in Python @LucaT? – Christopher Turnbull Mar 09 '21 at 10:12

2 Answers2

8

But is there really not a way to count nulls in a table with say, over 30 columns? Like I don't want to specify them all by name?

yes exactly that. I don't understand why it's so difficult - it's like 1 line in pandas?

Keypoint here is if something is not provided as "batteries included" then you need to write your own version. It is not so hard as it may look.

Let's say the input table is as follow:

CREATE OR REPLACE TABLE t AS SELECT $1 AS col1, $2 AS col2, $3 AS col3, $4 AS col4 
FROM VALUES (1,2,3,10),(NULL,2,3,10),(NULL,NULL,4,10),(NULL,NULL,NULL,10);

SELECT * FROM t;
/*
+------+------+------+------+
| COL1 | COL2 | COL3 | COL4 |
+------+------+------+------+
| 1    | 2    | 3    |   10 |
| NULL | 2    | 3    |   10 |
| NULL | NULL | 4    |   10 |
| NULL | NULL | NULL |   10 |
+------+------+------+------+
*/

You probably know how to write the query that gives the desired output, but as it was not provided in the question I will use my own version:

WITH cte AS (
  SELECT
      COUNT(*) AS total_rows
      ,total_rows - COUNT(col1) AS col1
      ,total_rows - COUNT(col2) AS col2
      ,total_rows - COUNT(col3) AS col3
      ,total_rows - COUNT(col4) AS col4
  FROM t
)
SELECT COLUMN_NAME, NULLS_COLUMN_COUNT,SUM(NULLS_COLUMN_COUNT) OVER() AS NULLS_TOTAL_COUNT
FROM cte
UNPIVOT (NULLS_COLUMN_COUNT FOR COLUMN_NAME IN (col1,col2,col3, col4))
ORDER BY COLUMN_NAME;
/*
+-------------+--------------------+-------------------+
| COLUMN_NAME | NULLS_COLUMN_COUNT | NULLS_TOTAL_COUNT |
+-------------+--------------------+-------------------+
| COL1        |                  3 |                 6 |
| COL2        |                  2 |                 6 |
| COL3        |                  1 |                 6 |
| COL4        |                  0 |                 6 |
+-------------+--------------------+-------------------+
*/

Here we could see that the query is "static" in nature with few moving parts(column_count_list/table_name/column_list):

WITH cte AS (
  SELECT
      COUNT(*) AS total_rows
      <column_count_list>
  FROM <table_name>
)
SELECT COLUMN_NAME, NULLS_COLUMN_COUNT,SUM(NULLS_COLUMN_COUNT) OVER() AS NULLS_TOTAL_COUNT
FROM cte
UNPIVOT (NULLS_COLUMN_COUNT FOR COLUMN_NAME IN (<column_list>))
ORDER BY COLUMN_NAME;

Now using the metadata and variables:

-- input
SET sch_name = 'my_schema';
SET tab_name = 't';

SELECT
     LISTAGG(c.COLUMN_NAME, ', ') WITHIN GROUP(ORDER BY c.COLUMN_NAME) AS column_list
    ,ANY_VALUE(c.TABLE_SCHEMA || '.' || c.TABLE_NAME) AS full_table_name
    ,LISTAGG(REPLACE(SPACE(6) || ',total_rows - COUNT(<col_name>) AS <col_name>' 
                              || CHAR(13)
             , '<col_name>', c.COLUMN_NAME), '') 
     WITHIN GROUP(ORDER BY COLUMN_NAME) AS column_count_list

    ,REPLACE(REPLACE(REPLACE(
'WITH cte AS (
  SELECT
      COUNT(*) AS total_rows
<column_count_list>
  FROM <table_name>
)
SELECT COLUMN_NAME, NULLS_COLUMN_COUNT,SUM(NULLS_COLUMN_COUNT) OVER() AS NULLS_TOTAL_COUNT
FROM cte
UNPIVOT (NULLS_COLUMN_COUNT FOR COLUMN_NAME IN (<column_list>))
ORDER BY COLUMN_NAME;'
    ,'<column_count_list>',     column_count_list)
    ,'<table_name>',            full_table_name)
    ,'<column_list>',           column_list) AS query_to_run

FROM INFORMATION_SCHEMA.COLUMNS c
WHERE TABLE_SCHEMA = UPPER($sch_name)
  AND TABLE_NAME = UPPER($tab_name);

Running the code will generate the query to be run:

enter image description here

Copying the output and running it will give the output. This template could be further refined and wrapped with stored procedure if needed(but I will left it as an exercise).

Lukasz Szozda
  • 162,964
  • 23
  • 234
  • 275
0

@chris you should note that the metadata in Snowflake is similar to SQL Server. So anything you want to know at metadata level, would have already been solved by SQL Server practitioners. See this link - Count number of NULL values in each column in SQL This is different in Oracle where the metadata table gives the number of nulls in each column as well as density.