WHERE clause is slower with value from CTE than with constant?

Question

I want to have a variable cached during a query performing on Postgres 12. I followed the approach of having CTE like below:

-- BEGIN PART 1
with cached_vars as (
    select max(datetime) as datetime_threshold
    from locations
    where distance > 70
      and user_id = 9087
)
-- END PART 1
-- BEGIN PART 2
select *
from locations
where user_id = 9087
  and datetime > (select datetime_threshold from cached_vars)
-- END PART 2

Running the above query will lead to performance issues. I expected the total runtime to approximately equal (part1 runtime + part2 runtime), but it takes a lot longer.

Notably, there is no performance issue when I run only the 2nd part with manual datetime_threshold.

locations table is defined as:

 id | user_id | datetime | location | distance | ...
-----------------------------------------------------

Is there any way to reduce the total runtime to something like (part1 runtime + part2 runtime)?

. . I don't understand your question. Your two "performance estimates" are the same equation. — Gordon Linoff, Mar 14 '21 at 14:40
@partizaans have you considered using a temporary table to hold date from first query? You can give it a try. Please look into my answer. — Kazi Mohammad Ali Nur Romel, Mar 14 '21 at 15:03
To get the best solution, please explain the objective of the query. Is `distance > 70` a constant filter? What else is variable? Do you need `SELECT *` or is a small subset of columns enough? Please share basic information as instructed here: https://stackoverflow.com/tags/postgresql-performance/info — Erwin Brandstetter, Mar 14 '21 at 15:08

Erwin Brandstetter · Accepted Answer · 2021-03-14T17:06:43.280

The explanation behind the difference you observed is this:

Postgres has column statistics and can adapt the query plan depending on the value of a provided constant for datetime_threshold. With favorable filter values, this can lead to a much more efficient query plan.

In the other case, when datetime_threshold has to be computed in another SELECT first, Postgres has to default to a generic plan. datetime_threshold could be anything.

The difference will become obvious in EXPLAIN output.

To make sure Postgres optimizes the second part for the actual datetime_threshold value, you can either run two separate queries (feed the result of query 1 as constant to query 2), or use dynamic SQL to force re-planning of query 2 every time in a PL/pgSQL function.

For example

CREATE OR REPLACE FUNCTION foo(_user_id int, _distance int = 70)
  RETURNS SETOF locations
  LANGUAGE plpgsql AS
$func$
BEGIN
   RETURN QUERY EXECUTE 
     'SELECT *
      FROM   locations
      WHERE  user_id = $1
      AND    datetime > $2'
   USING _user_id
      , (SELECT max(datetime)
         FROM   locations
         WHERE  distance > _distance
         AND    user_id = _user_id);
END
$func$;

Call:

SELECT * FROM foo(9087);

As for "something useful in the docs":

[...] The important difference is that EXECUTE will re-plan the command on each execution, generating a plan that is specific to the current parameter values; whereas PL/pgSQL may otherwise create a generic plan and cache it for re-use. In situations where the best plan depends strongly on the parameter values, it can be helpful to use EXECUTE to positively ensure that a generic plan is not selected.

Bold emphasis mine.

Indexes

Perfect indexes would be:

CREATE INDEX ON locations (user_id, distance DESC NULL LAST, date_time DESC NULLS LAST); -- for query 1
CREATE INDEX ON locations (user_id, date_time);           -- for query 2

Fine tuning depends on undisclosed details. Partial index might be an option.

There may be any number of additional reasons why your query is slow. Not enough details.

Thanks. So is there any idiomatic way to have `datetime_threshold` as a constant to avoid such performance issues that are caused by the `select` statement in the `where` clause of the second part? Of course, I can handle this problem in the application layer over the DB level but I got curious about this while being unable to find anything useful on the docs. — partizaans, Mar 14 '21 at 15:42

Gordon Linoff · Answer 2 · 2021-03-14T14:40:02.643

0

If you want your query to perform well, I would suggest adding the indexes locations(user_id, distance) and locations(user_id, datetime).

I would also phrase the query using window functions:

select l.*
from (select l.*,
             max(datetime) filter (where distance > 70) over (partition by userid) as datetime_threshold
      from location l
      where userid = 9087
     ) l
where datetime > datetime_threshold;

Window functions often improve performance. However, with the right indexes, I don't know if the two versions will be substantially different.

edited Mar 14 '21 at 14:40

answered Mar 14 '21 at 14:37

Gordon Linoff

1,242,037
58
646
786

That didn't help also. There is no problem with the indexes. The `with` clause runs in 2 secs approx. And the second part takes 1.5 secs if the dynamic part [I mean `(select datetime_threshold from cached_vars)` statement] is substituted with the result of the first part [sth like '2021-03-14 14:40:23.000000 +00:00']. The combination of these parts breaks the performance (the query takes more than 2 minutes to perform). – partizaans Mar 14 '21 at 14:50

Kazi Mohammad Ali Nur Romel · Answer 3 · 2021-03-14T17:03:39.700

0

Please break the query into two part and store the first part in a temp table (temporary table in PostgreSQL is only accessible in current database session.). Then join the temp table with second part. Hope it will speed up the processing time.

 CREATE TEMPORARY TABLE temp_table_cached_vars (
       datetime_threshold timestamp
    );
    
    -- BEGIN PART 1
    with cached_vars as (
        select max(datetime) as datetime_threshold
        from locations
        where distance > 70
          and user_id = 9087
    )insert into temp_table_name select datetime_threshold from cached_vars 
    -- END PART 1
    -- BEGIN PART 2
    select *
    from locations
    where user_id = 9087
      and datetime > (select datetime_threshold from temp_table_cached_vars Limit 1)

-- END PART 2

edited Mar 14 '21 at 17:03

answered Mar 14 '21 at 15:00

Kazi Mohammad Ali Nur Romel

15,301
2
13
26

Thanks! I tested your answer and the problem still exists. I have no idea why there is this huge difference when `datetime > (select datetime_threshold from temp_table_cached_vars )` is changed to sth like `datetime > '2021-03-14 15:17:23.000000'` and how to overcome that. – partizaans Mar 14 '21 at 15:23
how many rows are there in locations table? – Kazi Mohammad Ali Nur Romel Mar 14 '21 at 15:26
1,300,000,000 approximately. – partizaans Mar 14 '21 at 15:28
Would you please try with distance <= 70 in where clause of second query? And please let me know the data type of datetime column. – Kazi Mohammad Ali Nur Romel Mar 14 '21 at 15:34
Sure I did! That didn't help also. – partizaans Mar 14 '21 at 15:43
And the datatype of datetime field? – Kazi Mohammad Ali Nur Romel Mar 14 '21 at 16:05
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/229898/discussion-between-kazi-mohammad-ali-nur-and-partizaans). – Kazi Mohammad Ali Nur Romel Mar 14 '21 at 17:03

score 0 · Answer 4 · answered Mar 14 '21 at 17:20

0

Just add Limi1 in the subquery as I used in below example.

-- BEGIN PART 1
with cached_vars as (
    select max(datetime) as datetime_threshold
    from locations
    where distance > 70
      and user_id = 9087
)
-- END PART 1
-- BEGIN PART 2
select *
from locations
where user_id = 9087
  and datetime > (select datetime_threshold from cached_vars Limit 1)
-- END PART 2

answered Mar 14 '21 at 17:20

Kazi Mohammad Ali Nur Romel

15,301
2
13
26

Thanks again, I tried your answer the performance issue still exists. I think the problem is from the query itself as @erwin-brandstetter mentioned in his answer. Thanks for your help. – partizaans Mar 15 '21 at 01:12

score 0 · Answer 5 · answered Jul 14 '23 at 17:27

0

Try to change it to

with cached_vars as not materialized (

Visit https://dba.stackexchange.com/questions/257014/are-there-side-effects-to-postgres-12s-not-materialized-directive for explanation

answered Jul 14 '23 at 17:27

Vladimir Nikotin

116
3

WHERE clause is slower with value from CTE than with constant?

5 Answers5

For example

Indexes