4

I want to clean up some data in a table before putting in a unique constraint on two columns.

CREATE TABLE test (
 a integer NOT NULL,
 b integer NOT NULL,
 c integer NOT NULL,
 CONSTRAINT a_pk PRIMARY KEY (a)
);

INSERT INTO test (a,b,c) VALUES
 (1,2,3)
,(2,2,3)
,(3,4,3)
,(4,4,4)
,(5,4,5)
,(6,4,4)
,(7,4,4);

-- SELECT a FROM test WHERE ????

Output should be 2,6,7

I am looking for all rows after the first that have duplicated b,c

EX:

  • Rows 1,2 have (2,3) as b,c Row 1 is ok because it is the first, 2 is not.

  • Rows 4,6,7 have (4,4) as b,c Row 4 is ok because it is the first, 6,7 are not.

I will then:

DELETE FROM test WHERE a = those IDs;

.. and add the unique constraint.

I was thinking about an intersect on test with itself but not sure where to do go from there.

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
sjakubowski
  • 2,913
  • 28
  • 36

4 Answers4

5

I ran a couple of tests. The EXISTS variant proves to be substantially faster - as I expected and contrary to what @Tometzky posted.

Test bed with 10.000 rows on PostgreSQL 9.1.2 with decent settings:

CREATE TEMP TABLE test (
  a serial
 ,b int NOT NULL
 ,c int NOT NULL
);

INSERT INTO test (b,c)
SELECT (random()* 100)::int AS b, (random()* 100)::int AS c
FROM   generate_series(1, 10000);

ALTER TABLE test ADD CONSTRAINT a_pk PRIMARY KEY (a);

Between the first and second round of tests, I ran:

ANALYZE test;

When I finally applied the DELETE, 3368 duplicates were deleted. Performance may vary if you have substantially more or fewer duplicates.

I ran each query a couple of times with EXPLAIN ANALYZE and took the best result. Generally, the best hardly differs from the first or worst.
A bare SELECT (without the DELETE) shows similar results.

1. CTE with rank()

Total runtime: 150.411 ms
Total runtime: 149.853 ms -- after ANALYZE

WITH x AS (
    SELECT a
          ,rank() OVER (PARTITION BY b, c ORDER BY a) AS rk
    FROM   test
    )
DELETE FROM test
USING  x
WHERE  x.a = test.a
AND    rk > 1;

2. CTE with row_number()

Total runtime: 148.240 ms
Total runtime: 147.711 ms -- after ANALYZE

WITH x AS (
    SELECT a
          ,row_number() OVER (PARTITION BY b, c ORDER BY a) AS rn
    FROM   test
    )
DELETE FROM test
USING  x
WHERE  x.a = test.a
AND    rn > 1;

3. row_number() in subquery

Total runtime: 134.753 ms
Total runtime: 134.298 ms -- after ANALYZE

DELETE FROM test
USING (
    SELECT a
          ,row_number() OVER (PARTITION BY b, c ORDER BY a) AS rn
    FROM   test
    )  x
WHERE  x.a = test.a
AND    rn > 1;

4. EXISTS semi-join

Total runtime: 143.777 ms
Total runtime: 69.072 ms -- after ANALYZE

DELETE FROM test t
WHERE EXISTS (
    SELECT 1
    FROM   test t1
    WHERE  t1.a < t.a
    AND   (t1.b, t1.c) = (t.b, t.c)
    );

The difference in the second run comes from a switch to a Hash Semi Join instead of an additional Sort + Merge Semi Join.

Results

  • EXISTS clearly wins with up-tp-date table statistics.
  • With outdated statistics row_number() in a subquery is fastest.
  • rank() is the slowest variant.
  • CTE is slower than subquery.
  • ANALYZE (updated statistics) helps performance and can help a lot. Autovacuum (default) should more or less take care of this automatically - except for temporary tables or immediately after major changes to the table. Read more here or here.

Test with 100.000 rows

I repeated the test with 100.000 rows and 63045 duplicates. Similar results, except that EXISTS was slower, even after ANALYZE.

  1. Total runtime: 1648.601 ms
  2. Total runtime: 1623.759 ms
  3. Total runtime: 1568.893 ms
  4. Total runtime: 1692.249 ms

Raising the statistics target to 1000 and then to the maximum of 10000 (overkill in real live) and another ANALYZE sped up all queries by ~ 1 %, but the query planner still went with Sort + Merge Semi Join for EXISTS.

ALTER TABLE test ALTER COLUMN b SET STATISTICS 10000;
ALTER TABLE test ALTER COLUMN c SET STATISTICS 10000;
ANALYZE test;

Only after I forced the planner to avoid the merge joins the planner used a Hash Semi Join taking half the time again:

SET enable_mergejoin = off
  1. Total runtime: 850.615 ms

Update

There have been improvements to the query planner since then. Went straight to Hash Semi Join in a retest with PostgreSQL 9.1.7.

Community
  • 1
  • 1
Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
  • This is strange. On my Core i3-2120 workstation (with Postgres 9.1.3 installed by Fedora 16 64bit package, untuned) I get in your 10000 row test in 5-7ms runtime (according to explain analyze) and exists test is indeed 1ms faster than my rank(). – Tometzky Jun 01 '12 at 08:25
  • @Tometzky: The hardware of my test server is like 6 years old. Seems it cannot compete with current hardware any more. – Erwin Brandstetter Jun 01 '12 at 14:34
  • @ErwinBrandstetter for the sake of a level playing field could you compare performance on your hardware of the method I posted? – David Aldridge Mar 13 '13 at 18:52
  • @DavidAldridge: I'd like to refer you to [this closely related answer on dba.SE](http://dba.stackexchange.com/questions/36539/how-do-i-remove-duplicate-records-in-a-join-table-in-psql/36614#36614), where I (just this minute) finished a performance test for exactly this case. It's the reason I updated this question in the first place. – Erwin Brandstetter Mar 13 '13 at 18:59
3

Using window functions should be much faster than this answer:

select a
from (
  select a, rank() over (partition by b, c order by a) as rank
  from test ) as _
where rank>1;
Community
  • 1
  • 1
Tometzky
  • 22,573
  • 5
  • 59
  • 73
2
select o.a from test o
where exists ( select 'x' 
                 from test i
                where i.c = o.c
                  and i.b = o.b
                  and i.a < o.a
            );

Thanks to a coworker!

sjakubowski
  • 2,913
  • 28
  • 36
0

I would give this a try:

delete from
  test
where
  a not in (
    select   min(a)
    from     test
    group by b,c)

Between 20 and 60ms on my machine, for what it's worth, and analyse does not affect the plan.

 Delete on test  (cost=237.50..412.50 rows=5000 width=6)
   ->  Seq Scan on test  (cost=237.50..412.50 rows=5000 width=6)
         Filter: (NOT (hashed SubPlan 1))
         SubPlan 1
           ->  HashAggregate  (cost=225.00..235.00 rows=1000 width=12)
                 ->  Seq Scan on test  (cost=0.00..150.00 rows=10000 width=12)
David Aldridge
  • 51,479
  • 8
  • 68
  • 96
  • This looks like `EXPLAIN` output. Try the same with [`EXPLAIN ANALYZE`](http://www.postgresql.org/docs/current/interactive/sql-explain.html) to get actual times. – Erwin Brandstetter Mar 13 '13 at 19:11