Clean up SQL data before unique constraint

Question

I want to clean up some data in a table before putting in a unique constraint on two columns.

CREATE TABLE test (
 a integer NOT NULL,
 b integer NOT NULL,
 c integer NOT NULL,
 CONSTRAINT a_pk PRIMARY KEY (a)
);

INSERT INTO test (a,b,c) VALUES
 (1,2,3)
,(2,2,3)
,(3,4,3)
,(4,4,4)
,(5,4,5)
,(6,4,4)
,(7,4,4);

-- SELECT a FROM test WHERE ????

Output should be 2,6,7

I am looking for all rows after the first that have duplicated b,c

EX:

Rows 1,2 have (2,3) as b,c Row 1 is ok because it is the first, 2 is not.
Rows 4,6,7 have (4,4) as b,c Row 4 is ok because it is the first, 6,7 are not.

I will then:

DELETE FROM test WHERE a = those IDs;

.. and add the unique constraint.

I was thinking about an intersect on test with itself but not sure where to do go from there.

score 5 · Accepted Answer · edited Jun 20 '20 at 09:12

I ran a couple of tests. The EXISTS variant proves to be substantially faster - as I expected and contrary to what @Tometzky posted.

Test bed with 10.000 rows on PostgreSQL 9.1.2 with decent settings:

CREATE TEMP TABLE test (
  a serial
 ,b int NOT NULL
 ,c int NOT NULL
);

INSERT INTO test (b,c)
SELECT (random()* 100)::int AS b, (random()* 100)::int AS c
FROM   generate_series(1, 10000);

ALTER TABLE test ADD CONSTRAINT a_pk PRIMARY KEY (a);

Between the first and second round of tests, I ran:

ANALYZE test;

When I finally applied the DELETE, 3368 duplicates were deleted. Performance may vary if you have substantially more or fewer duplicates.

I ran each query a couple of times with EXPLAIN ANALYZE and took the best result. Generally, the best hardly differs from the first or worst.
A bare SELECT (without the DELETE) shows similar results.

1. CTE with `rank()`

Total runtime: 150.411 ms
Total runtime: 149.853 ms -- after ANALYZE

WITH x AS (
    SELECT a
          ,rank() OVER (PARTITION BY b, c ORDER BY a) AS rk
    FROM   test
    )
DELETE FROM test
USING  x
WHERE  x.a = test.a
AND    rk > 1;

2. CTE with `row_number()`

Total runtime: 148.240 ms
Total runtime: 147.711 ms -- after ANALYZE

WITH x AS (
    SELECT a
          ,row_number() OVER (PARTITION BY b, c ORDER BY a) AS rn
    FROM   test
    )
DELETE FROM test
USING  x
WHERE  x.a = test.a
AND    rn > 1;

3. `row_number()` in subquery

Total runtime: 134.753 ms
Total runtime: 134.298 ms -- after ANALYZE

DELETE FROM test
USING (
    SELECT a
          ,row_number() OVER (PARTITION BY b, c ORDER BY a) AS rn
    FROM   test
    )  x
WHERE  x.a = test.a
AND    rn > 1;

4. `EXISTS` semi-join

Total runtime: 143.777 ms
Total runtime: 69.072 ms -- after ANALYZE

DELETE FROM test t
WHERE EXISTS (
    SELECT 1
    FROM   test t1
    WHERE  t1.a < t.a
    AND   (t1.b, t1.c) = (t.b, t.c)
    );

The difference in the second run comes from a switch to a Hash Semi Join instead of an additional Sort + Merge Semi Join.

Results

EXISTS clearly wins with up-tp-date table statistics.
With outdated statistics row_number() in a subquery is fastest.
rank() is the slowest variant.
CTE is slower than subquery.
ANALYZE (updated statistics) helps performance and can help a lot. Autovacuum (default) should more or less take care of this automatically - except for temporary tables or immediately after major changes to the table. Read more here or here.

Test with 100.000 rows

I repeated the test with 100.000 rows and 63045 duplicates. Similar results, except that EXISTS was slower, even after ANALYZE.

Total runtime: 1648.601 ms
Total runtime: 1623.759 ms
Total runtime: 1568.893 ms
Total runtime: 1692.249 ms

Raising the statistics target to 1000 and then to the maximum of 10000 (overkill in real live) and another ANALYZE sped up all queries by ~ 1 %, but the query planner still went with Sort + Merge Semi Join for EXISTS.

ALTER TABLE test ALTER COLUMN b SET STATISTICS 10000;
ALTER TABLE test ALTER COLUMN c SET STATISTICS 10000;
ANALYZE test;

Only after I forced the planner to avoid the merge joins the planner used a Hash Semi Join taking half the time again:

SET enable_mergejoin = off

Total runtime: 850.615 ms

Update

There have been improvements to the query planner since then. Went straight to Hash Semi Join in a retest with PostgreSQL 9.1.7.

This is strange. On my Core i3-2120 workstation (with Postgres 9.1.3 installed by Fedora 16 64bit package, untuned) I get in your 10000 row test in 5-7ms runtime (according to explain analyze) and exists test is indeed 1ms faster than my rank(). — Tometzky, Jun 01 '12 at 08:25
@Tometzky: The hardware of my test server is like 6 years old. Seems it cannot compete with current hardware any more. — Erwin Brandstetter, Jun 01 '12 at 14:34
@ErwinBrandstetter for the sake of a level playing field could you compare performance on your hardware of the method I posted? — David Aldridge, Mar 13 '13 at 18:52
@DavidAldridge: I'd like to refer you to [this closely related answer on dba.SE](http://dba.stackexchange.com/questions/36539/how-do-i-remove-duplicate-records-in-a-join-table-in-psql/36614#36614), where I (just this minute) finished a performance test for exactly this case. It's the reason I updated this question in the first place. — Erwin Brandstetter, Mar 13 '13 at 18:59

score 3 · Answer 2 · edited May 23 '17 at 12:25

3

Using window functions should be much faster than this answer:

select a
from (
  select a, rank() over (partition by b, c order by a) as rank
  from test ) as _
where rank>1;

edited May 23 '17 at 12:25

Community

1
1

answered May 31 '12 at 16:56

Tometzky

22,573
5
59
73

I posted the result of a test battery that point in the opposite direction. – Erwin Brandstetter May 31 '12 at 22:50
I underestimated the power of Postgres. `exists` is indeed somewhat faster than `rank`. – Tometzky Jun 01 '12 at 08:30

score 2 · Answer 3 · answered May 31 '12 at 15:42

2

select o.a from test o
where exists ( select 'x' 
                 from test i
                where i.c = o.c
                  and i.b = o.b
                  and i.a < o.a
            );

Thanks to a coworker!

answered May 31 '12 at 15:42

sjakubowski

2,913
28
36

Your coworker gave good advice. My test results seem to agree. – Erwin Brandstetter May 31 '12 at 22:52

score 0 · Answer 4 · answered Mar 13 '13 at 18:51

I would give this a try:

delete from
  test
where
  a not in (
    select   min(a)
    from     test
    group by b,c)

Between 20 and 60ms on my machine, for what it's worth, and analyse does not affect the plan.

 Delete on test  (cost=237.50..412.50 rows=5000 width=6)
   ->  Seq Scan on test  (cost=237.50..412.50 rows=5000 width=6)
         Filter: (NOT (hashed SubPlan 1))
         SubPlan 1
           ->  HashAggregate  (cost=225.00..235.00 rows=1000 width=12)
                 ->  Seq Scan on test  (cost=0.00..150.00 rows=10000 width=12)

This looks like `EXPLAIN` output. Try the same with [`EXPLAIN ANALYZE`](http://www.postgresql.org/docs/current/interactive/sql-explain.html) to get actual times. — Erwin Brandstetter, Mar 13 '13 at 19:11

Clean up SQL data before unique constraint

4 Answers4

1. CTE with `rank()`

2. CTE with `row_number()`

3. `row_number()` in subquery

4. `EXISTS` semi-join

Results

Test with 100.000 rows

Update

Linked

Clean up SQL data before unique constraint

4 Answers4

1. CTE with rank()

2. CTE with row_number()

3. row_number() in subquery

4. EXISTS semi-join

Results

Test with 100.000 rows

Update

Linked

1. CTE with `rank()`

2. CTE with `row_number()`

3. `row_number()` in subquery

4. `EXISTS` semi-join