I ran a couple of tests. The EXISTS
variant proves to be substantially faster - as I expected and contrary to what @Tometzky posted.
Test bed with 10.000 rows on PostgreSQL 9.1.2 with decent settings:
CREATE TEMP TABLE test (
a serial
,b int NOT NULL
,c int NOT NULL
);
INSERT INTO test (b,c)
SELECT (random()* 100)::int AS b, (random()* 100)::int AS c
FROM generate_series(1, 10000);
ALTER TABLE test ADD CONSTRAINT a_pk PRIMARY KEY (a);
Between the first and second round of tests, I ran:
ANALYZE test;
When I finally applied the DELETE, 3368 duplicates were deleted. Performance may vary if you have substantially more or fewer duplicates.
I ran each query a couple of times with EXPLAIN ANALYZE
and took the best result. Generally, the best hardly differs from the first or worst.
A bare SELECT
(without the DELETE
) shows similar results.
1. CTE with rank()
Total runtime: 150.411 ms
Total runtime: 149.853 ms -- after ANALYZE
WITH x AS (
SELECT a
,rank() OVER (PARTITION BY b, c ORDER BY a) AS rk
FROM test
)
DELETE FROM test
USING x
WHERE x.a = test.a
AND rk > 1;
2. CTE with row_number()
Total runtime: 148.240 ms
Total runtime: 147.711 ms -- after ANALYZE
WITH x AS (
SELECT a
,row_number() OVER (PARTITION BY b, c ORDER BY a) AS rn
FROM test
)
DELETE FROM test
USING x
WHERE x.a = test.a
AND rn > 1;
3. row_number()
in subquery
Total runtime: 134.753 ms
Total runtime: 134.298 ms -- after ANALYZE
DELETE FROM test
USING (
SELECT a
,row_number() OVER (PARTITION BY b, c ORDER BY a) AS rn
FROM test
) x
WHERE x.a = test.a
AND rn > 1;
4. EXISTS
semi-join
Total runtime: 143.777 ms
Total runtime: 69.072 ms -- after ANALYZE
DELETE FROM test t
WHERE EXISTS (
SELECT 1
FROM test t1
WHERE t1.a < t.a
AND (t1.b, t1.c) = (t.b, t.c)
);
The difference in the second run comes from a switch to a Hash Semi Join
instead of an additional Sort + Merge Semi Join.
Results
EXISTS
clearly wins with up-tp-date table statistics.
- With outdated statistics
row_number()
in a subquery is fastest.
rank()
is the slowest variant.
- CTE is slower than subquery.
ANALYZE
(updated statistics) helps performance and can help a lot. Autovacuum (default) should more or less take care of this automatically - except for temporary tables or immediately after major changes to the table. Read more here or here.
Test with 100.000 rows
I repeated the test with 100.000 rows and 63045 duplicates. Similar results, except that EXISTS
was slower, even after ANALYZE
.
- Total runtime: 1648.601 ms
- Total runtime: 1623.759 ms
- Total runtime: 1568.893 ms
- Total runtime: 1692.249 ms
Raising the statistics target to 1000 and then to the maximum of 10000 (overkill in real live) and another ANALYZE
sped up all queries by ~ 1 %, but the query planner still went with Sort + Merge Semi Join for EXISTS
.
ALTER TABLE test ALTER COLUMN b SET STATISTICS 10000;
ALTER TABLE test ALTER COLUMN c SET STATISTICS 10000;
ANALYZE test;
Only after I forced the planner to avoid the merge joins the planner used a Hash Semi Join taking half the time again:
SET enable_mergejoin = off
- Total runtime: 850.615 ms
Update
There have been improvements to the query planner since then. Went straight to Hash Semi Join in a retest with PostgreSQL 9.1.7.