1

I am very much new to SQL and have been given an exercise for homework that sounds when translated like this: "For each column type find the number of tables that do not contain this type of column". I know how to find count of tables that contain at least one column type:

SELECT Data_Type, COUNT(Table_schema || '.' || table_name)
FROM Information_schema.Columns
GROUP BY Data_Type;

But trying to figure out how to do the opposite of this has left me scratching my head for hours. I have tried to use WHERE:

WHERE Data_Type IS NULL

But it gives same results as the query without this line. I have seen solutions when counting tables with specifically named columns (using 'join' and 'not in') but I don't think this would work for this task. I think NOT EXISTS would probably work but not sure how to implement it. Thanks in advance!

  • 1
    In your query, the expression `COUNT(Table_schema || '.' || table_name)` counts the number of columns, not the number of tables. How can this be changed so that tables aren't counted more than once? Don't over think the solution to the original problem. Restated, it's asking you to find for each data type, the count of all tables minus the count of tables that have one or more columns of that data type. – JohnH Apr 18 '23 at 00:36
  • Thanks for the tip! This is the part I got snagged on without even knowing it! – ConfusedByRegex Apr 18 '23 at 07:22

3 Answers3

1

You can do something like this:

with dtypes as (
    select distinct Data_Type
    from Information_schema.Columns
  )
select
  data_type,
  (
    select count(*)
    from Information_schema.tables t
    where not exists (
        select 1
        FROM Information_schema.Columns c2
        WHERE
          c2.Table_schema = t.Table_schema
          and c2.Table_name = t.Table_name
          and c2.Data_Type = d.Data_Type
      )
    AND t.table_type = 'BASE TABLE'
    -- AND t.Table_schema = 'MYSCHEMA' -- in case you want to count tables only from single schema.
  )
from dtypes d

Here CTE dtypes contains all types present in database as column types.

Then for every type we find number of tables, that do not have at least one column of this type.

Demo here

markalex
  • 8,623
  • 2
  • 7
  • 32
  • Besides regular tables, this query will also include views, foreign tables, and local temporary tables. – JohnH Apr 18 '23 at 03:42
  • @JohnH, fair point. Corrected. – markalex Apr 18 '23 at 03:50
  • Thanks! Never thought to start with with. All our previous assignments were simple in that regard so it was confusing to what to start with properly – ConfusedByRegex Apr 18 '23 at 07:21
  • @ConfusedByRegex, CTE here is not required. It used just for readability sake. Main difficulty in this task is that you required 1. dynamically get list of types, 2. list of tables, 3. Filter list of tables by types not used in it. Hence three sub-query are used. – markalex Apr 18 '23 at 07:26
  • Is there a way to avoid nested queries in this? We were taught to avoid them, but I can't really see how to walk around this solution, not while using WHERE NOT EXISTS. – ConfusedByRegex Apr 18 '23 at 08:21
  • @ConfusedByRegex, you can rewrite it using left join, group by and having, but it wouldn't be better performance-wise or clear to read. I believe this is exactly the case where sub-query are acceptable (and even preferred) – markalex Apr 18 '23 at 08:27
  • @ConfusedByRegex, saying that "subqueries should be avoided" is just another way of saying that they should only be used when ncessary. Often aggregate and window functions can be employed instead of subqueries to achieve the same results more efficiently. – JohnH Apr 18 '23 at 15:52
1

Here is another approach to solving the problem:

WITH tc AS (
  SELECT
    t.table_schema,
    t.table_name,
    COUNT(*) OVER (PARTITION BY NULL) AS total_table_cnt
  FROM
    information_schema.tables t
  WHERE
    t.table_type = 'BASE TABLE'
)
SELECT
  c.data_type,
  MAX(tc.total_table_cnt) - COUNT(DISTINCT (c.table_schema, c.table_name)) AS table_cnt
FROM
  tc
  JOIN information_schema.columns c
    ON (tc.table_schema, tc.table_name) = (c.table_schema, c.table_name)
GROUP BY
  c.data_type
ORDER BY
  c.data_type;

This isn't presented with any claims of superiority over other approaches, but just as a demonstration of additional techniques.

The query uses a common table expression (CTE) to determine the set of tables. The expression COUNT(*) OVER (PARTITION BY NULL) gets a count of the members at the same time, thus avoiding an additional subquery.

In the main query, MAX(tc.total_table_cnt) wraps the total table count in an aggregate function so that it's value can be used in a grouped query. MIN could also have been used since the value is the same for all rows in tc.

COUNT(DISTINCT (c.table_schema, c.table_name)) gets the count of tables having at least one column of the group's data type.

Row constructors, (tc.table_schema, tc.table_name) = (c.table_schema, c.table_name), are used for the join criteria. There's no significant advantage in this case over using separate equality comparisons for each column, but row constructors can be very useful when multiple columns are needed with the IN operator or subqueries.

JohnH
  • 2,001
  • 1
  • 2
  • 13
0

The question really needs to be defined more closely.

  • Which tables and which types to include?
  • Include internal data types and system tables?
  • Is varchar(25) distinct from varchar?
  • What about partitioned tables?
  • Materialized views?
  • Generated columns?
  • User types? ...

In my hands, this query is around 50x faster than proposed queries based on the information schema:

SELECT col.atttypid::regtype AS data_type
     , tbl.tbls_total - col.tbls_with_type AS tbls_without_that_type
FROM  (
   SELECT a.atttypid::regtype, count(DISTINCT a.attrelid) AS tbls_with_type
   FROM   pg_attribute a
   JOIN   pg_class c ON c.oid = a.attrelid
   WHERE  c.relnamespace = 'public'::regnamespace  -- only user schema(s)
   AND    c.relkind = 'r'
   AND    a.attnum > 0
   AND    NOT a.attisdropped
   GROUP  BY 1
   ) col
CROSS JOIN (
   SELECT  count(*) AS tbls_total
   FROM    pg_class c
   WHERE   c.relnamespace = 'public'::regnamespace  -- only user schema(s)
   AND     relkind = 'r'
   ) tbl
ORDER  BY 2 DESC, 1

The basic approach is to count existing tables for each data type, and then subtract that from the total count of tables. Simpler.

I excluded system columns and limited the query to the public schema. Adapt to your needs.

There are various differences!

Notably, the query on the information_schema lumps all "ARRAY" types together, while my query treats text[], int[], etc. as distinct types. (Much more sensible IMO.)

Also, the information schema does not include partitioned tables (properly). You may or may not want that.

About information schema vs. system catalogs:

Related:

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
  • Using ordinals in `GROUP BY` and `ORDER BY` clauses is a poor practice that makes code more difficult to understand and increases the likelihood that someone will unintentionally alter a query's behavior. – JohnH Apr 18 '23 at 21:18
  • @JohnH: Ordinal numbers are standard SQL, with some advantages and disadvantages. Your opinion is duly noted, but I don't subscribe to that general statement. – Erwin Brandstetter Apr 18 '23 at 21:27
  • Other than brevity, I'd be interested to learn of even one advantage to using ordinals. Just because something is in a standard, that doesn't make it good practice. – JohnH Apr 18 '23 at 21:58
  • @JohnH The SQL standard is a bit wobbly what it accepts in `GROUP BY` and `ORDER BY`. Input column names, output column names, expressions? See: https://www.postgresql.org/docs/current/sql-select.html#id-1.9.3.172.10.9 The ordinal number has the same scope and meaning in all versions of Postgres. I have seen my share of confusion stemming from this. Like https://stackoverflow.com/a/19849537/939860, https://stackoverflow.com/a/8119815/939860, https://stackoverflow.com/a/11218453/939860, https://stackoverflow.com/a/40538650/939860 etc. – Erwin Brandstetter Apr 18 '23 at 22:17
  • @JohnH: Ordinal numbers can also help to avoid naming collisions with variables and parameters in server-side functions. Or with dynamic SQL. Or when you really want to order by the Nth column in the `SELECT` list. And yes, brevity matters. Replacing lengthy expressions translates to improved readability for me. That said, plain column names are typically clearer, especially for beginners. – Erwin Brandstetter Apr 18 '23 at 22:18