5

I'm having a Postgresql (version 9.4) performance puzzle. I have a function (prevd) declared as STABLE (see below). When I run this function on a constant in where clause, it is called multiple times - instead of once. If I understand postgres documentation correctly, the query should be optimized to call prevd only once.

A STABLE function cannot modify the database and is guaranteed to return the same results given the same arguments for all rows within a single statement

Why it doesn't optimize calls to prevd in this case? I'm not expecting prevd to be called once for all subsequent queries using prevd on the same argument (like it was IMMUTABLE). I'm expecting postgres to create a plan for my query with just one call to prevd('2015-12-12')

Please find the code below:

Schema

create table somedata(d date, number double precision);
create table dates(d date);

insert into dates
select generate_series::date
from   generate_series('2015-01-01'::date, '2015-12-31'::date, '1 day');

insert into somedata
select '2015-01-01'::date + (random() * 365 + 1)::integer, random()
from   generate_series(1, 100000);

create or replace function prevd(date_ date)
returns date
language sql
stable
as $$
  select max(d) from dates where d < date_;
$$

Slow Query

select avg(number) from somedata where d=prevd('2015-12-12');

Poor query plan of the query above

 Aggregate  (cost=28092.74..28092.75 rows=1 width=8) (actual time=3532.638..3532.638 rows=1 loops=1)
   Output: avg(number)
   ->  Seq Scan on public.somedata  (cost=0.00..28091.43 rows=525 width=8) (actual time=10.210..3532.576 rows=282 loops=1)
         Output: d, number
         Filter: (somedata.d = prevd('2015-12-12'::date))
         Rows Removed by Filter: 99718
 Planning time: 1.144 ms
 Execution time: 3532.688 ms
(8 rows)

Performance

The query above, on my machine runs around 3.5s. After changing prevd to IMMUTABLE, it's changing to 0.035s.

Marcin Krupowicz
  • 536
  • 6
  • 16
  • 1
    Possible duplicate of [Why is PostgreSQL calling my STABLE/IMMUTABLE function multiple times?](http://stackoverflow.com/questions/8529690/why-is-postgresql-calling-my-stable-immutable-function-multiple-times) - the test case in the question is different, but the answer covers your case as well. – IMSoP Feb 02 '16 at 18:23
  • My query calls STABLE function on a constant. Query from the mentioned question is calling IMMUTABLE function on non-costant value (at least not directly) – Marcin Krupowicz Feb 02 '16 at 18:34
  • Take a look at the *answer* on the linked question, rather than just the question itself. It covers just about every combination of stable/immutable and constant/non-constant argument, including one exactly like yours: "With `where test_multi_calls1(30) != num` query re-writing will happen for immutable but not for merely stable functions". – IMSoP Feb 02 '16 at 23:31
  • I don't know "why", but you can use `select avg(number) from somedata where d=(select prevd(date '2015-12-12'));` - it seems that the subquery forces the optimizer to "cache" or "materialize" a function resut,in the memory, and the query is fast. – krokodilko Feb 02 '16 at 23:34
  • @IMSoP: still, it doesn't answer why it is not optimized - like the query in the comment above (@kordirko). From the docs: "[...] Because of this snapshotting behavior, a function containing only SELECT commands can safely be marked STABLE, even if it selects from tables that might be undergoing modifications by concurrent queries. PostgreSQL will execute all commands of a STABLE function using the snapshot established for the calling query, and so it will see a fixed view of the database throughout that query.". I don't understand why it can't work like the optimization with subquery. – Marcin Krupowicz Feb 03 '16 at 10:31

1 Answers1

4

I started writing this as a comment, but it got a bit long, so I'm expanding it into an answer.

As discussed in this previous answer, Postgres does not promise to always optimise based on STABLE or IMMUTABLE annotations, only that it can sometimes do so. It does this by planning the query differently by taking advantage of certain assumptions. This part of the previous answer is directly analogous to your case:

This particular sort of rewriting depends upon immutability or stability. With where test_multi_calls1(30) != num query re-writing will happen for immutable but not for merely stable functions.

If you change the function to IMMUTABLE and look at the query plan, you will see that the rewriting it does is really rather radical:

Seq Scan on public.somedata  (cost=0.00..1791.00 rows=272 width=12) (actual time=0.036..14.549 rows=270 loops=1)
  Output: d, number
  Filter: (somedata.d = '2015-12-11'::date)
  Buffers: shared read=541 written=14
Total runtime: 14.589 ms

It actually runs the function while planning the query, and substitutes the value before the query is even executed. With a STABLE function, this optimisation would clearly not be appropriate - the data might change between planning and executing the query.

In a comment, it was mentioned that this query results in an optimised plan:

select avg(number) from somedata where d=(select prevd(date '2015-12-12'));

This is fast, but note that the plan doesn't look anything like what the IMMUTABLE version did:

Aggregate  (cost=1791.69..1791.70 rows=1 width=8) (actual time=14.670..14.670 rows=1 loops=1)
  Output: avg(number)
  Buffers: shared read=541 written=21
  InitPlan 1 (returns $0)
    ->  Result  (cost=0.00..0.01 rows=1 width=0) (actual time=0.001..0.001 rows=1 loops=1)
          Output: '2015-12-11'::date
  ->  Seq Scan on public.somedata  (cost=0.00..1791.00 rows=273 width=8) (actual time=0.026..14.589 rows=270 loops=1)
        Output: d, number
        Filter: (somedata.d = $0)
        Buffers: shared read=541 written=21
Total runtime: 14.707 ms

By putting it into a sub-query, you are moving the function call from the WHERE clause to the SELECT clause. More importantly, the sub-query can always be executed once and used by the rest of the query; so the function is run once in a separate node of the plan.

To confirm this, we can take the SQL out of a function altogether:

select avg(number) from somedata where d=(select max(d) from dates where d <  '2015-12-12');

This gives a rather longer plan with very similar performance:

Aggregate  (cost=1799.12..1799.13 rows=1 width=8) (actual time=14.174..14.174 rows=1 loops=1)
  Output: avg(somedata.number)
  Buffers: shared read=543 written=19
  InitPlan 1 (returns $0)
    ->  Aggregate  (cost=7.43..7.44 rows=1 width=4) (actual time=0.150..0.150 rows=1 loops=1)
          Output: max(dates.d)
          Buffers: shared read=2
          ->  Seq Scan on public.dates  (cost=0.00..6.56 rows=347 width=4) (actual time=0.015..0.103 rows=345 loops=1)
                Output: dates.d
                Filter: (dates.d < '2015-12-12'::date)
                Buffers: shared read=2
  ->  Seq Scan on public.somedata  (cost=0.00..1791.00 rows=273 width=8) (actual time=0.190..14.098 rows=270 loops=1)
        Output: somedata.d, somedata.number
        Filter: (somedata.d = $0)
        Buffers: shared read=543 written=19
Total runtime: 14.232 ms

The important thing to note is that the inner Aggregate (the max(d)) is executed once, on a separate node from the main Seq Scan (which is checking the where clause). In this position, even a VOLATILE function can be optimised in the same way.

In short, while you know that the query you've produced can be optimised by executing the function only once, it doesn't match any of the patterns that Postgres's query planner knows how to rewrite, so it uses a naive plan which runs the function multiple times.

[Note: all tests performed on Postgres 9.1, because it's what I happened to have to hand.]

Community
  • 1
  • 1
IMSoP
  • 89,526
  • 13
  • 117
  • 169
  • Thanks! It's very surprising to me. I was expecting that Postgres in general expands function calls to the function's body (and effectively runs a query wich subquery in this case). At least functions which are IMMUTABLE and STABLE, since they can't modify the database. – Marcin Krupowicz Feb 03 '16 at 13:44
  • Would you say that using function calls in joins (select * from a join b(...) on .. join c(...) on ...) will result in poorer query plan than query written with substituted b(..) and c(..) with their bodies? – Marcin Krupowicz Feb 03 '16 at 13:51
  • @MarcinKrupowicz Remember that functions in Postgres aren't necessarily written in SQL; mostly, they'll be in pl/pgsql, but you can plug in other languages, like Perl etc. So without specifically optimising functions where the language is set to "sql", the default behaviour has to be "execute the function whenever you need its value". There *could* be more special cases built in, but the more complexity you add, the more risk of bugs or adverse side-effects. – IMSoP Feb 03 '16 at 14:25
  • @MarcinKrupowicz As for joins, I think it would very much depend. Ultimately, the best thing to do is to build a couple of versions of the query and use `EXPLAIN ANALYZE` to see what the query planner comes up with. – IMSoP Feb 03 '16 at 14:27