2

I have found solutions (I think) to the problem I'm about to ask for on Oracle and SQL Server, but can't seem to translate this into a Postgres solution. I am using Postgres 9.3.6.

The idea is to be able to generate "metadata" about the table content for profiling purposes. This can only be done (AFAIK) by having queries run for each column so as to find out, say... min/max/count values and such. In order to automate the procedure, it is preferable to have the queries generated by the DB, then executed.

With an example salesdata table, I'm able to generate a select query for each column, returning the min() value, using the following snippet:

SELECT 'SELECT min('||column_name||') as minval_'||column_name||' from salesdata '  
FROM information_schema.columns 
WHERE table_name = 'salesdata'

The advantage being that the db will generate the code regardless of the number of columns. Now there's a myriad places I had in mind for storing these queries, either a variable of some sort, or a table column, the idea being to then have these queries execute. I thought of storing the generated queries in a variable then executing them using the EXECUTE (or EXECUTE IMMEDIATE) statement which is the approach employed here (see right pane), but Postgres won't let me declare a variable outside a function and I've been scratching my head with how this would fit together, whether that's even the direction to follow, perhaps there's something simpler.

Would you have any pointers, I'm currently trying something like this, inspired by this other question but have no idea whether I'm headed in the right direction:

CREATE OR REPLACE FUNCTION foo()
RETURNS void AS
$$
DECLARE
    dyn_sql text; 
BEGIN            
dyn_sql := SELECT 'SELECT min('||column_name||') from salesdata'    
    FROM information_schema.columns 
    WHERE table_name = 'salesdata';
execute dyn_sql
END
$$ LANGUAGE PLPGSQL;    
Community
  • 1
  • 1
EJB
  • 23
  • 1
  • 1
  • 4
  • It's really unclear what you are trying to do. Generate query strings to save them? Execute the queries and save results? Please clarify the task. And, as *always*, provide your version of Postgres, please. – Erwin Brandstetter Apr 13 '15 at 22:34
  • "for profiling purposes" --- what does this exactly mean? – zerkms Apr 13 '15 at 23:00
  • What would you like your function to return? The minimum of each column? – tsnorri Apr 13 '15 at 23:05
  • 2
    There is no `EXECUTE IMMEDIATE` in plpgsql, just `EXECUTE`. – Erwin Brandstetter Apr 14 '15 at 01:24
  • @zerkms: data profiling is a way of getting a good idea of what is in your table, min/max/average values, value frequencies, value distribution, count of nulls, etc... this is a starting point for data quality evaluation. – EJB Apr 14 '15 at 06:07
  • I also answered the question that inspired you: http://stackoverflow.com/a/29616496/939860 – Erwin Brandstetter Apr 14 '15 at 23:51

2 Answers2

6

System statistics

Before you roll your own, have a look at the system table pg_statistic or the view pg_stats:

This view allows access only to rows of pg_statistic that correspond to tables the user has permission to read, and therefore it is safe to allow public read access to this view.

It might already have some of the statistics you are about to compute. It's populated by ANALYZE, so you might run that for new (or any) tables before checking.

-- ANALYZE tbl;  -- optionally, to init / refresh
SELECT * FROM pg_stats
WHERE tablename = 'tbl'
AND   schemaname = 'public';

Generic dynamic plpgsql function

You want to return the minimum value for every column in a given table. This is not a trivial task, because a function (like SQL in general) demands to know the return type at creation time - or at least at call time with the help of polymorphic data types.

This function does everything automatically and safely. Works for any table, as long as the aggregate function min() is allowed for every column. But you need to know your way around PL/pgSQL.

CREATE OR REPLACE FUNCTION f_min_of(_tbl anyelement)
  RETURNS SETOF anyelement
  LANGUAGE plpgsql AS
$func$
BEGIN
RETURN QUERY EXECUTE (
   SELECT format('SELECT (t::%2$s).* FROM (SELECT min(%1$s) FROM %2$s) t'
                , string_agg(quote_ident(attname), '), min(' ORDER BY attnum)
                , pg_typeof(_tbl)::text)
   FROM   pg_attribute
   WHERE  attrelid = pg_typeof(_tbl)::text::regclass
   AND    NOT attisdropped  -- no dropped (dead) columns
   AND    attnum > 0        -- no system columns
   );
END
$func$;

Call (important!):

SELECT * FROM f_min_of(NULL::tbl);  -- tbl being the table name

db<>fiddle here
Old sqlfiddle

You need to understand these concepts:

  • Dynamic SQL in plpgsql with EXECUTE
  • Polymorphic types
  • Row types and table types in Postgres
  • How to defend against SQL injection
  • Aggregate functions
  • System catalogs

Related answer with detailed explanation:

Special difficulty with type mismatch

I am taking advantage of Postgres defining a row type for every existing table. Using the concept of polymorphic types I am able to create one function that works for any table.

However, some aggregate functions return related but different data types as compared to the underlying column. For instance, min(varchar_column) returns text, which is bit-compatible, but not exactly the same data type. PL/pgSQL functions have a weak spot here and insist on data types exactly as declared in the RETURNS clause. No attempt to cast, not even implicit casts, not to speak of assignment casts.

That should be improved. Tested with Postgres 9.3. Did not retest with 9.4, but I am pretty sure, nothing has changed in this area.

That's where this construct comes in as workaround:

SELECT (t::tbl).* FROM (SELECT ... FROM tbl) t;

By casting the whole row to the row type of the underlying table explicitly we force assignment casts to get original data types for every column.

This might fail for some aggregate function. sum() returns numeric for a sum(bigint_column) to accommodate for a sum overflowing the base data type. Casting back to bigint might fail ...

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
0

@Erwin Brandstetter, Many thanks for the extensive answer. pg_stats does indeed provide a few things, but what I really need to draw a complete profile is a variety of things, min, max values, counts, count of nulls, mean etc... so a bunch of queries have to be ran for each columns, some with GROUP BY and such.

Also, thanks for highlighting the importance of data types, i was sort of expecting this to throw a spanner in the works at some point, my main concern was with how to automate the query generation, and its execution, this last bit being my main concern.

I have tried the function you provide (I probably will need to start learning some plpgsql) but get a error at the SELECT (t::tbl) :

ERROR: type "tbl" does not exist

btw, what is the (t::abc) notation referred as, in python this would be a list slice, but it’s probably not the case in PLPGSQL

EJB
  • 23
  • 1
  • 1
  • 4
  • Be aware that I am not notified of additional answers. Found this by chance. I am only notified with comments (or answers) to a post of mine or if you @-reply. [Detailed instructions here](http://meta.stackexchange.com/questions/43019/how-do-comment-replies-work) – Erwin Brandstetter Apr 14 '15 at 12:20
  • About the error: I accidentally hard-coded the table name in one place. Consider my update. – Erwin Brandstetter Apr 14 '15 at 12:25
  • Thanks again for the extensive answers, tried your code which now works great. I'll have to try and format in the exact manner i'd like the results presented but it's great for now. – EJB Apr 15 '15 at 08:35