1

I need some help with plpgsql function in postgres. This question has been asked on psql-general without conclusive answer (here and here)

This boils down to this :

how to create a plpgsql function with a set of rows as input and a set of rows as output without using array (for performance).

I currently transmit input as the ref of a cursor but it is not satisfactory because it forces to compute twice the same query (SELECT ...hard_work...) AND it messes transactions in my application.

Currently it works like :

DECLARE cursor FOR ...hardwork...;

WITH first_query AS (
SELECT ...hard_work... --second computation of hard_work
   ),
second_query AS ( 
    SELECT ...another_query_using_hard_work...
   )
SELECT *
FROM my_function('cursor'::refcursor) f(...) ;

Ultimately I would like to have something like (not working)

WITH first_query AS ( 
    SELECT ...hard_work...
   ),
second_query AS ( 
    SELECT ...another_query_using_the_hard_work_query...
   )
SELECT *
FROM my_function('first_query'::regclass) f(...) ;

Of course SELECT ...hard_work... is expensive (around 50ms) and it would be better not to compute it twice. The application is data streaming, so time is precious, and data are heavy, so copying data in temp table may be worse than computing twice (typically tens of MB ).

Other solutions were proposed

  • transmitting a view or a table ref as input, which is the same problem as using a cursor (ie : twice the computing, separate statement)

  • transmitting/outputting array : forces a lot of computing with array_agg() and unnest()

  • transmitting a temp table : involves copying data: It might be longer than computing twice (I'm fighting for 50 ms)

  • transmitting a materialized view : only available in 9.3

I would be very grateful to have an advanced insight on the subject.

Best Regards,

Rémi-C

PS : link to questions on psql-general mailing list contains a lots of details about specific and even all the code.

PPS : version : postgres 9.2 . OS : Ubuntu 12.04 LTE, client : PGAdmin3 for test, node.js for prod.

Community
  • 1
  • 1
Remi-C
  • 11
  • 3
  • 1
    Resultsets/tables are not first class objects in PostgreSQL. Usually this is worked around by passing the results around in a temporary table with a hardcoded name. – Quassnoi Sep 18 '13 at 13:47
  • You have posted the same question three times without actually giving details of the queries involved. All you can expect is vague generalities. – Richard Huxton Sep 18 '13 at 22:07
  • Hello @Richard Huxtons thanks for answering, I understand it's hard to answer without all the specifics, but please note I'm not asking help to optimise the query, but only a general way of not computing it twice. The query is about spatial and temporal filtering on a 12Go 1 million rows table, with indexes and all. My project is about storing massive point cloud in postgres ([github here](https://github.com/remi-c)). Again this query is about 50 ms. – Remi-C Sep 19 '13 at 07:34
  • You;d probably have to use an array internally but you could use an aggregate or window function to do it. You can't just pass a set in, it has to go row-by-row. But with aggregates, you can use whatever internal format you want. – Chris Travers Nov 16 '13 at 11:02

2 Answers2

1

As far as I can see, you have not tried to use temporary tables, yet.

You can do this with a hardcoded name or even use dynamic names, possibly utilizing a SEQUENCE to get unique table names. You would then use dynamic SQL with EXECUTE inside a PL/pgSQL function and pass the table name (or even better: the object identifier type regclass) to it.

Be sure to run ANALYZE manually on bigger temporary tables right after major changes (usually after initially filling it), since temporary tables are not visible to the autovacuum daemon.
Possibly even create indices on big temp tables!

You can find many code example here on Stackoverflow. One particularly rich and loosely related answer:
Refactor a PL/pgSQL function to return the output of various SELECT queries

Community
  • 1
  • 1
Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
  • Thanks for your answer ( @Quassnoi also). I edited the question to be more precise about this solution. Using temp table would involve data copy, which I guess would be longer than 50 ms. (EDIT : trying it : it is 100ms at least) – Remi-C Sep 18 '13 at 14:50
  • @Remi-C: "data copy" means copying to *RAM* in this case. This step seems inevitable if you want to do *anything* with your data ... – Erwin Brandstetter Sep 18 '13 at 16:48
  • Yes you are right I expect everything to be in RAM (big server), but still it takes 100ms just to create this temporary table. Maybe overhead from WAL logging and all? Cheers, Rémi-C – Remi-C Sep 19 '13 at 07:36
  • @Remi-C: Temporary tables are not WAL-logged. But the table is registered in the catalog tables, which means a tiny write operation on creation and at the end of the session. – Erwin Brandstetter Sep 24 '13 at 19:59
0

I think the typical way to solve some thing like this would be with a custom aggregate/function. Then you can use whatever internal storage you want. Each row gets passed in directly, transformed, and then processed later.

Chris Travers
  • 25,424
  • 6
  • 65
  • 182