Apply the same aggregate to every column in a table

Question

I am using a proprietary mpp database that has been forked off psql 8.3. I am trying to apply a simple count to a wide table (around 450 columns) and so I was wondering if the best way to do this in terms of a simple sql function. I am just counting the number of distinct values in a given column as well as the count of the number of null values in the column. The query i want to generalize for every column is for example

If i want to run the query against the column names i write

select
count(distinct names) d_names,
sum(case when names is not null then 1 else 0 end) n_s_ip
from table;

How do i generalize the query above to iterate through every column in the table if the number of columns is 450 without writing out each column name by hand?

score 3 · Accepted Answer · edited May 23 '17 at 12:30

3

First, since COUNT() only counts non-null values, your query can be simplified:

SELECT count(DISTINCT names) AS unique_names
      ,count(names) AS names_not_null
FROM   table;

But that's the number of non-null values and contradicts your description:

count of the number of null values in the column

For that you would use:

count(*) - count(names) AS names_null

Since count(*) count all rows and count(names) only rows with non-null names.
_{Removed inferior alternative after hint by @Andriy.}

To automate that for all columns build an SQL statement off of the catalog table pg_attribute dynamically. You can use EXECUTE in a PL/pgSQL function to execute it immediately. Find full code examples with links to the manual and explanation under these closely related questions:

edited May 23 '17 at 12:30

Community

1
1

answered Jan 14 '13 at 06:14

Erwin Brandstetter

605,456
145
1,078
1,228

1

*‘…Or faster with only one scan: …’* – what is that supposed to mean? Would having both `count(*)` and `count(name)` in the same statement require two scans to get the results? – Andriy M Jan 15 '13 at 18:57
@AndriyM: Good question. It's done in *one* scan either way. I didn't think this through properly. After running a few tests: the alternative expression is also slightly slower (~ 1%), so I removed the noise. – Erwin Brandstetter Jan 16 '13 at 00:04
Oh, wow, thanks for sharing your observation, that's a great thing to keep in mind! – Andriy M Jan 16 '13 at 06:19

score 3 · Answer 2 · answered Jan 14 '13 at 06:18

You can generate the repetitive part of query by using information_scheam.columns.

select 'count(distinct '||column_name||') d_names, sum(case when '||column_name||' is not null then 1 else 0 end) n_s_ip,' 
from information_schema.columns where table_name='table'
order by ordinal_position;

The above query will generate count(...) and sum(...) for each column of table. This result can be used as select-list for your query. You can cut&paste the result to the following query:

select 
-- paste here
from table;

After paste, you have to remove the last comma.

In this way, you can avoid writing select-list for 450 columns.

Apply the same aggregate to every column in a table

2 Answers2