3

I have a table in my Postgres database that I'm trying to determine fill rates for (that is, I'm trying to understand how often data is/isn't missing). I need to make a function that, for each column (in a list of a couple dozen columns I've selected), counts the number and percentage of columns with non-null values.

The problem is, I don't really know how to iterate through a list of columns in a programmatic way, because I don't know how to reference a column from a string of its name. I've read about how you can use the EXECUTE command to run dynamically-written SQL, but I haven't been able to get it to work. Here's my current function:

CREATE OR REPLACE FUNCTION get_fill_rates() RETURNS TABLE (field_name text, fill_count integer, fill_percentage float) AS $$
DECLARE
    fields text[] := array['column_a', 'column_b', 'column_c'];
    total_rows integer;
BEGIN
    SELECT reltuples INTO total_rows FROM pg_class WHERE relname = 'my_table';

    FOR i IN array_lower(fields, 1) .. array_upper(fields, 1)
    LOOP
        field_name := fields[i];
        EXECUTE 'SELECT COUNT(*) FROM my_table WHERE $1 IS NOT NULL' INTO fill_count USING field_name;
        fill_percentage := fill_count::float / total_rows::float;
        RETURN NEXT;
    END LOOP;
END;
$$ LANGUAGE plpgsql;
SELECT * FROM get_fill_rates() ORDER BY fill_count DESC;

This function, as written, returns every field as having a 100% fill rate, which I know to be false. How can I make this function work?

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
Hayden Schiff
  • 3,280
  • 19
  • 41

3 Answers3

2

I know you already solved it. But let me suggest you to avoid concatenating identifiers on dynamic queries, you can use format with a identifier wildcard instead:

CREATE OR REPLACE FUNCTION get_fill_rates() RETURNS TABLE (field_name text, fill_count integer, fill_percentage float) AS $$
DECLARE
    fields text[] := array['column_a', 'column_b', 'column_c'];
    table_name name := 'my_table';
    total_rows integer;
BEGIN
    SELECT reltuples INTO total_rows FROM pg_class WHERE relname = table_name;

    FOREACH field_name IN ARRAY fields
    LOOP
        EXECUTE format('SELECT COUNT(*) FROM %I WHERE %I IS NOT NULL', table_name, field_name) INTO fill_count;
        fill_percentage := fill_count::float / total_rows::float;
        RETURN NEXT;
    END LOOP;
END;
$$ LANGUAGE plpgsql;

Doing this way will help you preventing SQL-injection attacks and will reduce query parse overhead a bit. More info here.

Michel Milezzi
  • 10,087
  • 3
  • 21
  • 36
1

I figured out the solution after I wrote my question but before I submitted it -- since I've already done the work of writing the question, I'll just go ahead and share the answer. The problem was in my EXECUTE statement, specifically with that USING field_name bit. I think it was getting treated as a string literal when I did it that way, which meant the query was evaluating if "a string literal" IS NOT NULL which of course, is always true.

Instead of parameterizing the column name, I need to inject it directly into the query string. So, I changed my EXECUTE line to the following:

EXECUTE 'SELECT COUNT(*) FROM my_table WHERE ' || field_name || ' IS NOT NULL' INTO fill_count;
Hayden Schiff
  • 3,280
  • 19
  • 41
1

Some problems in the code aside (see below), this can be substantially faster and simpler with a single scan over the table in a plain query:

SELECT v.*
FROM  (
   SELECT count(column_a) AS ct_column_a
        , count(column_b) AS ct_column_b
        , count(column_c) AS ct_column_c
        , count(*)::numeric AS ct
   FROM   my_table
   ) sub
     , LATERAL (
      VALUES
         (text 'column_a', ct_column_a, round(ct_column_a / ct, 3))
       , (text 'column_b', ct_column_b, round(ct_column_b / ct, 3))
       , (text 'column_c', ct_column_c, round(ct_column_c / ct, 3))
   ) v(field_name, fill_count, fill_percentage);

The crucial "trick" here is that count() only counts non-null values to begin with, no tricks required.

I rounded the percentage to 3 decimal digits, which is optional. For this I cast to numeric.

Use a VALUES expression to unpivot the results and get one row per field.

For repeated use or if you have a long list of columns to process, you can generate and execute the query dynamically. But, again, don't run a separate count for each column. Just build above query dynamically:

CREATE OR REPLACE FUNCTION get_fill_rates(tbl regclass, fields text[])
  RETURNS TABLE (field_name text, fill_count bigint, fill_percentage numeric) AS
$func$
BEGIN
RETURN QUERY EXECUTE (
-- RAISE NOTICE '%', (  -- to debug if needed
   SELECT
     'SELECT v.*
      FROM  (
         SELECT count(*)::numeric AS ct
              , ' || string_agg(format('count(%I) AS %I', fld, 'ct_' || fld), ', ') || '
         FROM   ' || tbl || '
         ) sub
           , LATERAL (
            VALUES
               (text ' || string_agg(format('%L, %2$I, round(%2$I/ ct, 3))', fld, 'ct_' || fld), ', (') || '
         ) v(field_name, fill_count, fill_pct)
      ORDER  BY v.fill_count DESC'
   FROM  unnest(fields) fld
   );
END
$func$ LANGUAGE plpgsql;

Call:

SELECT * FROM get_fill_rates('my_table', '{column_a, column_b, column_c}');

As you can see, this works for any given table and column list now.
And all identifiers are properly quoted automatically, using format() or by the built-in virtues of the regclass type.

Related:


Your original query could be improved like this, but this is just lipstick on a pig. Do not use this inefficient approach.

CREATE OR REPLACE FUNCTION get_fill_rates()
  RETURNS TABLE (field_name text, fill_count bigint, fill_percentage float) AS
$$
DECLARE
   fields text[] := '{column_a, column_b, column_c}';  -- must be legal identifiers!
   total_rows float;  -- use float right away
BEGIN
   SELECT reltuples INTO total_rows FROM pg_class WHERE relname = 'my_table';

   FOREACH field_name IN ARRAY fields  -- use FOREACH
   LOOP
      EXECUTE 'SELECT COUNT(*) FROM big WHERE ' || field_name || ' IS NOT NULL'
      INTO fill_count;
      fill_percentage := fill_count / total_rows;  -- already type float
      RETURN NEXT;
   END LOOP;
END
$$ LANGUAGE plpgsql;

Plus, pg_class.reltuples is only an estimate. Since you are counting anyway, use an actual count.

Related:

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228