0

I want to have a variable cached during a query performing on Postgres 12. I followed the approach of having CTE like below:

-- BEGIN PART 1
with cached_vars as (
    select max(datetime) as datetime_threshold
    from locations
    where distance > 70
      and user_id = 9087
)
-- END PART 1
-- BEGIN PART 2
select *
from locations
where user_id = 9087
  and datetime > (select datetime_threshold from cached_vars)
-- END PART 2

Running the above query will lead to performance issues. I expected the total runtime to approximately equal (part1 runtime + part2 runtime), but it takes a lot longer.

Notably, there is no performance issue when I run only the 2nd part with manual datetime_threshold.

locations table is defined as:

 id | user_id | datetime | location | distance | ...
-----------------------------------------------------

Is there any way to reduce the total runtime to something like (part1 runtime + part2 runtime)?

partizaans
  • 295
  • 2
  • 10
  • . . I don't understand your question. Your two "performance estimates" are the same equation. – Gordon Linoff Mar 14 '21 at 14:40
  • @partizaans have you considered using a temporary table to hold date from first query? You can give it a try. Please look into my answer. – Kazi Mohammad Ali Nur Romel Mar 14 '21 at 15:03
  • To get the best solution, please explain the objective of the query. Is `distance > 70` a constant filter? What else is variable? Do you need `SELECT *` or is a small subset of columns enough? Please share basic information as instructed here: https://stackoverflow.com/tags/postgresql-performance/info – Erwin Brandstetter Mar 14 '21 at 15:08

5 Answers5

3

The explanation behind the difference you observed is this:

Postgres has column statistics and can adapt the query plan depending on the value of a provided constant for datetime_threshold. With favorable filter values, this can lead to a much more efficient query plan.

In the other case, when datetime_threshold has to be computed in another SELECT first, Postgres has to default to a generic plan. datetime_threshold could be anything.

The difference will become obvious in EXPLAIN output.

To make sure Postgres optimizes the second part for the actual datetime_threshold value, you can either run two separate queries (feed the result of query 1 as constant to query 2), or use dynamic SQL to force re-planning of query 2 every time in a PL/pgSQL function.

For example

CREATE OR REPLACE FUNCTION foo(_user_id int, _distance int = 70)
  RETURNS SETOF locations
  LANGUAGE plpgsql AS
$func$
BEGIN
   RETURN QUERY EXECUTE 
     'SELECT *
      FROM   locations
      WHERE  user_id = $1
      AND    datetime > $2'
   USING _user_id
      , (SELECT max(datetime)
         FROM   locations
         WHERE  distance > _distance
         AND    user_id = _user_id);
END
$func$;

Call:

SELECT * FROM foo(9087);

Related:

In extreme cases, you might even use another dynamic query to calculate datetime_threshold. But I don't expect that's necessary.

As for "something useful in the docs":

[...] The important difference is that EXECUTE will re-plan the command on each execution, generating a plan that is specific to the current parameter values; whereas PL/pgSQL may otherwise create a generic plan and cache it for re-use. In situations where the best plan depends strongly on the parameter values, it can be helpful to use EXECUTE to positively ensure that a generic plan is not selected.

Bold emphasis mine.

Indexes

Perfect indexes would be:

CREATE INDEX ON locations (user_id, distance DESC NULL LAST, date_time DESC NULLS LAST); -- for query 1
CREATE INDEX ON locations (user_id, date_time);           -- for query 2

Fine tuning depends on undisclosed details. Partial index might be an option.

There may be any number of additional reasons why your query is slow. Not enough details.

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
  • Thanks. So is there any idiomatic way to have `datetime_threshold` as a constant to avoid such performance issues that are caused by the `select` statement in the `where` clause of the second part? Of course, I can handle this problem in the application layer over the DB level but I got curious about this while being unable to find anything useful on the docs. – partizaans Mar 14 '21 at 15:42
  • I added an example solution. – Erwin Brandstetter Mar 14 '21 at 16:33
0

If you want your query to perform well, I would suggest adding the indexes locations(user_id, distance) and locations(user_id, datetime).

I would also phrase the query using window functions:

select l.*
from (select l.*,
             max(datetime) filter (where distance > 70) over (partition by userid) as datetime_threshold
      from location l
      where userid = 9087
     ) l
where datetime > datetime_threshold;

Window functions often improve performance. However, with the right indexes, I don't know if the two versions will be substantially different.

Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
  • That didn't help also. There is no problem with the indexes. The `with` clause runs in 2 secs approx. And the second part takes 1.5 secs if the dynamic part [I mean `(select datetime_threshold from cached_vars)` statement] is substituted with the result of the first part [sth like '2021-03-14 14:40:23.000000 +00:00']. The combination of these parts breaks the performance (the query takes more than 2 minutes to perform). – partizaans Mar 14 '21 at 14:50
0

Please break the query into two part and store the first part in a temp table (temporary table in PostgreSQL is only accessible in current database session.). Then join the temp table with second part. Hope it will speed up the processing time.

 CREATE TEMPORARY TABLE temp_table_cached_vars (
       datetime_threshold timestamp
    );
    
    -- BEGIN PART 1
    with cached_vars as (
        select max(datetime) as datetime_threshold
        from locations
        where distance > 70
          and user_id = 9087
    )insert into temp_table_name select datetime_threshold from cached_vars 
    -- END PART 1
    -- BEGIN PART 2
    select *
    from locations
    where user_id = 9087
      and datetime > (select datetime_threshold from temp_table_cached_vars Limit 1)

-- END PART 2
0

Just add Limi1 in the subquery as I used in below example.

-- BEGIN PART 1
with cached_vars as (
    select max(datetime) as datetime_threshold
    from locations
    where distance > 70
      and user_id = 9087
)
-- END PART 1
-- BEGIN PART 2
select *
from locations
where user_id = 9087
  and datetime > (select datetime_threshold from cached_vars Limit 1)
-- END PART 2
  • Thanks again, I tried your answer the performance issue still exists. I think the problem is from the query itself as @erwin-brandstetter mentioned in his answer. Thanks for your help. – partizaans Mar 15 '21 at 01:12
0

Try to change it to

with cached_vars as not materialized (

Visit https://dba.stackexchange.com/questions/257014/are-there-side-effects-to-postgres-12s-not-materialized-directive for explanation