9

I have these two tables and want to delete all authors from ms_author, who are not present in author.

author (1.6M rows)

+-------+-------------+------+-----+-------+
| Field | Type        | Null | Key | index |
+-------+-------------+------+-----+-------+
| id    | text        | NO   | PRI | true  |
| name  | text        | YES  |     |       |
+-------+-------------+------+-----+-------+

ms_author (120M rows)

+-------+-------------+------+-----+-------+
| Field | Type        | Null | Key | index |
+-------+-------------+------+-----+-------+
| id    | text        | NO   | PRI |       |
| name  | text        | YES  |     | true  |
+-------+-------------+------+-----+-------+

This is my query:

    DELETE
FROM ms_author AS m
WHERE m.name NOT IN
                   (SELECT a.name
                    FROM author AS a);

I tried to estimate the query duration: ~ 130 hours.
Is there a faster way to achieve this?

EDIT:

EXPLAIN VERBOSE output

Delete on public.ms_author m  (cost=0.00..2906498718724.75 rows=59946100 width=6)"
  ->  Seq Scan on public.ms_author m  (cost=0.00..2906498718724.75 rows=59946100 width=6)"
        Output: m.ctid"
        Filter: (NOT (SubPlan 1))"
        SubPlan 1"
          ->  Materialize  (cost=0.00..44334.43 rows=1660295 width=15)"
                Output: a.name"
                ->  Seq Scan on public.author a  (cost=0.00..27925.95 rows=1660295 width=15)"
                      Output: a.name"

Indexing author(name):

create index author_name on author(name);

Indexing ms_author(name):

create index ms_author_name on ms_author(name);
Sebbas
  • 399
  • 4
  • 12

2 Answers2

15

I'm a big fan of the "anti-join." This works efficiently for both large and small datasets:

delete from ms_author ma
where not exists (
  select null
  from author a
  where ma.name = a.name
)
Hambone
  • 15,600
  • 8
  • 46
  • 69
  • 2
    That's the way to go. `NOT IN (SELECT ...)` is a tricky clause. Typically, there are [better alternatives](http://stackoverflow.com/a/19364694/939860). – Erwin Brandstetter Dec 14 '15 at 19:12
  • Thanks! :) It took about 10 hours. Coming from around 130 hours is a huge improvement! ;) – Sebbas Dec 16 '15 at 14:33
2

Your delete query using NOT IN usually result in a nested loop antijoin which will result in poor performance. You can rewrite your query as follows:

You can write something like this:

DELETE FROM ms_author AS m
WHERE m.id IN
               (SELECT m.id FROM ms_author AS m
                LEFT JOIN author AS a ON m.name = a.name
                WHERE a.name IS NULL);

This approach has as additional advantage that you are using the primary key 'id' to delete rows and this should be much faster.

Alex
  • 21,273
  • 10
  • 61
  • 73