3

I am using a proprietary mpp database that has been forked off psql 8.3. I am trying to apply a simple count to a wide table (around 450 columns) and so I was wondering if the best way to do this in terms of a simple sql function. I am just counting the number of distinct values in a given column as well as the count of the number of null values in the column. The query i want to generalize for every column is for example

If i want to run the query against the column names i write

select
count(distinct names) d_names,
sum(case when names is not null then 1 else 0 end) n_s_ip
from table; 

How do i generalize the query above to iterate through every column in the table if the number of columns is 450 without writing out each column name by hand?

laxonline
  • 2,657
  • 1
  • 20
  • 37
user7980
  • 703
  • 3
  • 15
  • 28

2 Answers2

3

First, since COUNT() only counts non-null values, your query can be simplified:

SELECT count(DISTINCT names) AS unique_names
      ,count(names) AS names_not_null
FROM   table;

But that's the number of non-null values and contradicts your description:

count of the number of null values in the column

For that you would use:

count(*) - count(names) AS names_null

Since count(*) count all rows and count(names) only rows with non-null names.
Removed inferior alternative after hint by @Andriy.

To automate that for all columns build an SQL statement off of the catalog table pg_attribute dynamically. You can use EXECUTE in a PL/pgSQL function to execute it immediately. Find full code examples with links to the manual and explanation under these closely related questions:

Community
  • 1
  • 1
Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
  • 1
    *‘…Or faster with only one scan: …’* – what is that supposed to mean? Would having both `count(*)` and `count(name)` in the same statement require two scans to get the results? – Andriy M Jan 15 '13 at 18:57
  • @AndriyM: Good question. It's done in *one* scan either way. I didn't think this through properly. After running a few tests: the alternative expression is also slightly slower (~ 1%), so I removed the noise. – Erwin Brandstetter Jan 16 '13 at 00:04
  • Oh, wow, thanks for sharing your observation, that's a great thing to keep in mind! – Andriy M Jan 16 '13 at 06:19
3

You can generate the repetitive part of query by using information_scheam.columns.

select 'count(distinct '||column_name||') d_names, sum(case when '||column_name||' is not null then 1 else 0 end) n_s_ip,' 
from information_schema.columns where table_name='table'
order by ordinal_position;

The above query will generate count(...) and sum(...) for each column of table. This result can be used as select-list for your query. You can cut&paste the result to the following query:

select 
-- paste here
from table;

After paste, you have to remove the last comma.

In this way, you can avoid writing select-list for 450 columns.

ntalbs
  • 28,700
  • 8
  • 66
  • 83