0

I'm wondering if somebody can explain why this runs so much longer using CTEs rather than temp tables... I'm basically deleting duplicate information out of a customer table (why duplicate information exists is beyond the scope of this post).

This is Postgres 9.5.

The CTE version is this:

with targets as
    (
        select
            id,
            row_number() over(partition by uuid order by created_date desc) as rn
        from
            customer
    )
delete from
    customer
where
    id in
        (
            select
                id
            from
                targets
            where
                rn > 1
        );

I killed that version this morning after running for over an hour.

The temp table version is this:

create temp table
    targets
as select
    id,
    row_number() over(partition by uuid order by created_date desc) as rn
from
    customer;

delete from
    customer
where
    id in
        (
            select
                id
            from
                targets
            where
                rn > 1
        );

This version finishes in about 7 seconds.

Any idea what may be causing this?

John Chrysostom
  • 3,973
  • 1
  • 34
  • 50
  • Because a CTE is executed unconditionally. It must be. In your case it will produce all the tuples, even those with rn=1. As an alternative to your temp table you could rewrite your CTE as a subquery. (which probably will be the fastest, too) – wildplasser Mar 02 '16 at 15:29

2 Answers2

4

The CTE is slower because it has to be executed unaltered (via a CTE scan).

TFM (section 7.8.2) states: Data-modifying statements in WITH are executed exactly once, and always to completion, independently of whether the primary query reads all (or indeed any) of their output. Notice that this is different from the rule for SELECT in WITH: as stated in the previous section, execution of a SELECT is carried only as far as the primary query demands its output.

It is thus an optimisation barrier; for the optimiser, dismantling the CTE is not allowed, even if it would result in a smarter plan with the same results.

The CTE-solution can be refactored into a joined subquery, though (similar to the temp table in the question). In postgres, a joined subquery is usually faster than the EXISTS() variant, nowadays.

DELETE FROM customer del
USING ( SELECT id
        , row_number() over(partition by uuid order by created_date desc)
                 as rn
        FROM customer
        ) sub
WHERE sub.id = del.id
AND sub.rn > 1
        ;

Another way is to use a TEMP VIEW. This is syntactically equivalent to the temp table case, but semantically equivalent to the joined subquery form (they yield exactly the same query plan, at least in this case). This is because Postgres's optimiser dismantles the view and combines it with the main query (pull-up). You could see a view as a kind of macro in PG.

CREATE TEMP VIEW targets
AS SELECT id
        , row_number() over(partition by uuid ORDER BY created_date DESC) AS rn
FROM customer;

EXPLAIN
DELETE FROM customer
WHERE id IN ( SELECT id
            FROM targets
            WHERE rn > 1
        );

[UPDATED: I was wrong about the CTEs need to be always-executed-to-completion, which is only the case for data-modifying CTEs]

wildplasser
  • 43,142
  • 8
  • 66
  • 109
  • What does it mean for a CTE to execute unconditionally? Googling "CTE execute unconditionally" returns very few relevant results... – John Chrysostom Mar 02 '16 at 19:13
  • TFM says: http://www.postgresql.org/docs/9.5/static/queries-with.html 7.8.2 `Data-modifying statements in WITH are executed exactly once, and always to completion, independently of whether the primary query reads all (or indeed any) of their output.` – wildplasser Mar 02 '16 at 19:22
  • No need to get snarky. I had assumed "unconditionally" was some sort of database term. Incidentally, the statement you quote there is unrelated to my situation (since I'm not using a data-modifying statement inside my CTE), but the manual also says `A useful property of WITH queries is that they are evaluated only once per execution of the parent query... However, the other side of this coin is that the optimizer is less able to push restrictions from the parent query down into a WITH query than an ordinary subquery.` – John Chrysostom Mar 02 '16 at 20:48
  • You are right, That looks like a more relevant part of the documentation. The conclusion that the CTE acts as an optimisation barrier still stands. (this seems to be unnecessary from a logical standpoint (since there are no side effects) hoisting the row_number() *could* be allowed. But it isn't. (a possible other complicating factor *could* be rewrite rules) – wildplasser Mar 02 '16 at 21:10
1

Using a CTE is likely going to cause different bottlenecks than using a temporary table. I'm not familiar with how PostgreSQL implements CTE, but it is likely in memory, so if your server is memory starved and the resultset of your CTE is very large then you could run into issues there. I would monitor the server while running your query and try to find where the bottleneck is.

An alternative way to doing that delete which might be faster than both of your methods:

DELETE C
FROM
    Customer C
WHERE
    EXISTS (SELECT * FROM Customer C2 WHERE C2.uuid = C.uuid AND C2.created_date > C.created_date)

That won't handle situations where you have exact matches with created_date, but that can be solved by adding the id to the subquery as well.

Tom H
  • 46,766
  • 14
  • 87
  • 128
  • You are wrong about CTE implementation. Your EXISTS() query looks good, apart from the flipped `<' . – wildplasser Mar 02 '16 at 15:42
  • I noticed the flipped comparison. But really one of the aliases is wrong if you want to fix that instead. – shawnt00 Mar 02 '16 at 15:42
  • Thanks, I've corrected the alias. I did a brief search of how PostgeSQL implements CTEs, but couldn't find anything other than the documentation which says to "think of them as temporary tables", but that doesn't mean that they are necessarily implemented like that. – Tom H Mar 02 '16 at 15:45
  • Just from a syntax perspective I like to use "SELECT NULL" instead of "SELECT *" all you are doing is to detect if the row exists, not if the content of the row exists. It becomes: DELETE C FROM Customer C WHERE EXISTS (SELECT NULL FROM Customer C2 WHERE C2.uuid = C.uuid AND C2.created_date > C.created_date) – Niels Andersen Nov 28 '21 at 20:14