How to pass a set of rows from one function into another?

Question

Overview

I'm using PostgreSQL 9.1.14, and I'm trying to pass the results of a function into another function. The general idea (specifics, with a minimal example, follow) is that we can write:

select * from (select * from foo ...)

and we can abstract the sub-select away in a function and select from it:

create function foos() 
returns setof foo
language sql as $$
  select * from foo ...
$$;

select * from foos()

Is there some way to abstract one level farther, so as to be able to do something like this (I know functions cannot actually have arguments with setof types):

create function more_foos( some_foos setof foo )
language sql as $$
  select * from some_foos ...  -- or unnest(some_foos), or ???
$$:

select * from more_foos(foos())

Minimal Example and Attempted Workarounds

I'm using PostgreSQL 9.1.14. Here's a minimal example:

-- 1. create a table x with three rows                                                                                                                                                            
drop table if exists x cascade;
create table if not exists x (id int, name text);
insert into x values (1,'a'), (2,'b'), (3,'c');

-- 2. xs() is a function with type `setof x`
create or replace function xs()
returns setof x
language sql as $$
  select * from x
$$;

-- 3. xxs() should return the context of x, too
--    Ideally the argument would be a `setof x`,
--    but that's not allowed (see below).
create or replace function xxs(x[])  
returns setof x
language sql as $$
  select x.* from x
  join unnest($1) y
       on x.id = y.id
$$;

When I load up this code, I get the expected output for the table definitions, and I can call and select from xs() as I'd expect. But when I try to pass the result of xs() to xxs(), I get an error that "function xxs(x) does not exist":

db=> \i test.sql 
DROP TABLE
CREATE TABLE
INSERT 0 3
CREATE FUNCTION
CREATE FUNCTION

db=> select * from xs();
  1 | a
  2 | b
  3 | c

db=> select * from xxs(xs());
ERROR:  function xxs(x) does not exist
LINE 1: select * from xxs(xs());
                      ^
HINT:  No function matches the given name and argument types. You might need to add explicit type casts.

I'm a bit confused about "function xxs(x) does not exist"; since the return type of xs() was setof x, I'd expected that its return type would be setof x (or maybe x[]), not x. Following the complaints about the type, I can get to either of the following , but while with either definition I can select xxs(xs());, I can't select * from xxs(xs());.

create or replace function xxs( x )
returns setof x
language sql as $$
  select x.* from x
  join unnest(array[$1]) y    -- unnest(array[...]) seems pretty bad
       on x.id = y.id
$$;

create or replace function xxs( x )
returns setof x
language sql as $$
  select * from x
         where x.id in ($1.id)
$$;

db=> select xxs(xs());
 (1,a)
 (2,b)
 (3,c)

db=> select * from xxs(xs());
ERROR:  set-valued function called in context that cannot accept a set

Summary

What's the right way to pass the results of a set-returning function into another function? (I have noted that create function … xxs( setof x ) … results in the error: ERROR: functions cannot accept set arguments, so the answer won't literally be passing a set of rows from one function to another.)

_What's the right way to pass the results of a set-returning function into another function?_: there is none because functions don't accept resultsets as input. What can lead to think otherwise is `select xxs(xs())` working, but it's a special case, a hack. It actually calls `xs()` once, iterates over the rows, invoking `xss(x)` for each one. Use cursors if you want to pass resultsets. — Daniel Vérité, Nov 15 '14 at 01:23
@DanielVérité I'm starting to worry that that's the case. It seems strange to me that we can: (i) select from subqueries, e.g., `select * from (select ...)`; (ii) write functions that return sets of results from queries, e.g., `create function foo() ... select * from ...`; and (iii) select from the return values of such functions, e.g., `select * from foo()`; but cannot (iv) abstract away one more level and do soime variant of `create function bar(X) ... select from X`. — Joshua Taylor, Nov 15 '14 at 03:09
@DanielVérité In the meantime I'll be looking into cursors, but it seems like an omission [says a newcomer to the language ;)] since in most languages, if you can replace an expression with a call to a function that has the expression as a body, you'd also expect to be able to replace the function call with a variable holding the results from an earlier call to the function. But as I say, I'm a bit of a newcomer here, and so I don't have anywhere near the perspective that others do; there's probably some good justification that I don't see yet. — Joshua Taylor, Nov 15 '14 at 03:12
The reason why you received "function xxs(x) does not exist", is because databases tend to support function overloading, where you can have multiple functions with the same function name, but each with a different set of parameters. Hence, at runtime when it tries to find your function, it uses BOTH the function name and the parameters to find the correct copy of the function. Function xxs(x) obviously existed, but not one that accepts a result set as input. — Mike Jones, Nov 18 '14 at 17:18
@MikeJones The `x` in `xxs(x)` is the *type* of the argument (since `x` is the name of a table). The oddity is that while `xs()` returns a `setof x`, the system is looking for a function that accepts a single `x`. The reason for that, as DanielVerite explained above, is that xxs(xs()) is actually calling xs(), getting the results (a setof x) back and trying to call xxs(...) on each one, which means that xxs(x) is needed. — Joshua Taylor, Nov 18 '14 at 17:38

score 8 · Answer 1 · edited Nov 24 '14 at 16:43

Table functions

I perform very high speed, complex database migrations for a living, using SQL as both the client and server language (no other language is used), all running server side, where the code rarely surfaces from the database engine. Table functions play a HUGE role in my work. I don't use "cursors" since they are too slow to meet my performance requirements, and everything I do is result set oriented. Table functions have been an immense help to me in completely eliminating use of cursors, achieving very high speed, and have contributed dramatically towards reducing code volume and improving simplicity.

In short, you use a query that references two (or more) table functions to pass the data from one table function to the next. The select query result set that calls the table functions serves as the conduit to pass the data from one table function to the next. On the DB2 platform / version I work on, and it appears based on a quick look at the 9.1 Postgres manual that the same is true there, you can only pass a single row of column values as input to any of the table function calls, as you've discovered. However, because the table function call happens in the middle of a query's result set processing, you achieve the same effect of passing a whole result set to each table function call, albeit, in the database engine plumbing, the data is passed only one row at a time to each table function.

Table functions accept one row of input columns, and return a single result set back into the calling query (i.e. select) that called the function. The result set columns passed back from a table function become part of the calling query's result set, and are therefore available as input to the next table function, referenced later in the same query, typically as a subsequent join. The first table function's result columns are fed as input (one row at a time) to the second table function, which returns its result set columns into the calling query's result set. Both the first and second table function result set columns are now part of the calling query's result set, and are now available as input (one row at a time) to a third table function. Each table function call widens the calling query's result set via the columns it returns. This can go on an on until you start hitting limits on the width of a result set, which likely varies from one database engine to the next.

Consider this example (which may not match Postgres' syntax requirements or capabilities as I work on DB2). This is one of many design patterns in which I use table functions, is one of the simpler ones, that I think is very illustrative, and one that I anticipate would have broad appeal if table functions were in heavy mainstream use (to my knowledge they are not, but I think they deserve more attention than they are getting).

In this example, the table functions in use are: VALIDATE_TODAYS_ORDER_BATCH, POST_TODAYS_ORDER_BATCH, and DATA_WAREHOUSE_TODAYS_ORDER_BATCH. On the DB2 version I work on, you wrap the table function inside "TABLE( place table function call and parameters here )", but based on quick look at a Postgres manual it appears you omit the "TABLE( )" wrapper.

create table TODAYS_ORDER_PROCESSING_EXCEPTIONS as (

select      TODAYS_ORDER_BATCH.*
           ,VALIDATION_RESULT.ROW_VALID
           ,POST_RESULT.ROW_POSTED
           ,WAREHOUSE_RESULT.ROW_WAREHOUSED

from        TODAYS_ORDER_BATCH

cross join  VALIDATE_TODAYS_ORDER_BATCH ( ORDER_NUMBER, [either pass the remainder of the order columns or fetch them in the function]  ) 
              as VALIDATION_RESULT ( ROW_VALID )  --example: 1/0 true/false Boolean returned

left join   POST_TODAYS_ORDER_BATCH ( ORDER_NUMBER, [either pass the remainder of the order columns or fetch them in the function] )
              as POST_RESULT ( ROW_POSTED )  --example: 1/0 true/false Boolean returned
      on    ROW_VALIDATED = '1'

left join   DATA_WAREHOUSE_TODAYS_ORDER_BATCH ( ORDER_NUMBER, [either pass the remainder of the order columns or fetch them in the function] )
              as WAREHOUSE_RESULT ( ROW_WAREHOUSED )  --example: 1/0 true/false Boolean returned
      on    ROW_POSTED = '1'

where       coalesce( ROW_VALID,      '0' ) = '0'   --Capture only exceptions and unprocessed work.  
      or    coalesce( ROW_POSTED,     '0' ) = '0'   --Or, you can flip the logic to capture only successful rows.
      or    coalesce( ROW_WAREHOUSED, '0' ) = '0'

) with data

If table TODAYS_ORDER_BATCH contains 1,000,000 rows, then VALIDATE_TODAYS_ORDER_BATCH will be called 1,000,000 times, once for each row.
If 900,000 rows pass validation inside VALIDATE_TODAYS_ORDER_BATCH, then POST_TODAYS_ORDER_BATCH will be called 900,000 times.
If only 850,000 rows successfully post, then VALIDATE_TODAYS_ORDER_BATCH needs some loopholes closed LOL, and DATA_WAREHOUSE_TODAYS_ORDER_BATCH will be called 850,000 times.
If 850,000 rows successfully made it into the Data Warehouse (i.e. no additional exceptions were generated), then table TODAYS_ORDER_PROCESSING_EXCEPTIONS will be populated with 1,000,000 - 850,000 = 150,000 exception rows.

The table function calls in this example are only returning a single column, but they could be returning many columns. For example, the table function validating an order row could return the reason why an order failed validation.

In this design, virtually all the chatter between a HLL and the database is eliminated, since the HLL requestor is asking the database to process the whole batch in ONE request. This results in a reduction of millions of SQL requests to the database, in a HUGE removal of millions of HLL procedure or method calls, and as a result provides a HUGE runtime improvement. In contrast, legacy code which often processes a single row at a time, would typically send 1,000,000 fetch SQL requests, 1 for each row in TODAYS_ORDER_BATCH, plus at least 1,000,000 HLL and/or SQL requests for validation purposes, plus at least 1,000,000 HLL and/or SQL requests for posting purposes, plus 1,000,000 HLL and/or SQL requests for sending the order to the data warehouse. Granted, using this table function design, inside the table functions SQL requests are being sent to the database, but when the database makes requests to itself (i.e from inside a table function), the SQL requests are serviced much faster (especially in comparison to a legacy scenario where the HLL requestor is doing single row processing from a remote system, with the worst case over a WAN - OMG please don't do that).

You can easily run into performance problems if you use a table function to "fetch a result set" and then join that result set to other tables. In that case, the SQL optimizer can't predict what set of rows will be returned from the table function, and therefore it can't optimize the join to subsequent tables. For that reason, I rarely use them for fetching a result set, unless I know that result set will be a very small number of rows, hence not causing a performance problem, or I don't need to join to subsequent tables.

In my opinion, one reason why table functions are underutilized is that they are often perceived as only a tool to fetch a result set, which often performs poorly, so they get written off as a "poor" tool to use.

Table functions are immensely useful for pushing more functionality over to the server, for eliminating most of the chatter between the database server and programs on remote systems, and even for eliminating chatter between the database server and external programs on the same server. Even chatter between programs on the same server carries more overhead than many people realize, and much of it is unnecessary. The heart of the power of table functions lies in using them to perform actions inside result set processing.

There are more advanced design patterns for using table functions that build on the above pattern, where you can maximize result set processing even further, but this post is a lot for most to absorb already.

+1 As this is a pretty nice writeup. I'm comfortable using multiple result functions and joining them, but the use case that I'm aiming for is along the lines of function f accepts a set of X rows and returns a set of Y rows based on them. f doesn't really care which X rows it gets, so it'd be nice to abstract that part away and let them be passed into the function. That way we can have `f(best_Xes())`, `f(worst_Xes())`, `f(Xes_of_SpecialInterest())`, etc. Now, I'm starting to think that the plumbing you mention is why this is disallowed; in order to optimize this kind of thing well, it's — Joshua Taylor, Nov 18 '14 at 13:33
important to be able to expand the definition of the function in place, so the optimizers really don't want a set of rows arriving at runtime, but want a call for which they can expand a query. Is that the root cause here? — Joshua Taylor, Nov 18 '14 at 13:34
I wish a table function could accept a whole result set, especially a dynamic result set, as input like you wish. For DB2 (version I'm on) and Postgres 9.1, such is not the case. I don't pretend to know all databases. If there is a database out there that supports that, it is a ROCK STAR! In order to handle what you want, I think the engineers creating the database system would have to generate a lot of code at runtime, and deal with a lot of complexity. The net result would be very difficult to optimize, and therefore not received well in the developer community. — Mike Jones, Nov 18 '14 at 17:24

How to pass a set of rows from one function into another?

Overview

Minimal Example and Attempted Workarounds

Summary

1 Answers1

Table functions

Linked