7

I need to delete the majority (say, 90%) of a very large table (say, 5m rows). The other 10% of this table is frequently read, but not written to.

From "Best way to delete millions of rows by ID", I gather that I should remove any index on the 90% I'm deleting, to speed up the process (except an index I'm using to select the rows for deletion).

From "PostgreSQL locking mode", I see that this operation will acquire a ROW EXCLUSIVE lock on the entire table. But since I'm only reading the other 10%, this ought not matter.

So, is it safe to delete everything in one command (i.e. DELETE FROM table WHERE delete_flag='t')? I'm worried that if the deletion of one row fails, triggering an enormous rollback, then it will affect my ability to read from the table. Would it be wiser to delete in batches?

Community
  • 1
  • 1
Simon Lepkin
  • 1,021
  • 1
  • 13
  • 25
  • 2
    A little advice first : Whatever you do, backup your database first, then process to the deletion DELETE FROM table WHERE delete_flag='t' – dev35000 Feb 03 '16 at 22:27
  • 2
    Unlike in Oracle, `rollback` is actually quite cheap in Postgres –  Feb 03 '16 at 23:54
  • To add to @a_horse_with_no_name's comment: "actually cheap" means PostgreSQL won't even touch your table in the case of the rollback. Just mark the transaction as such. O(1) using big O notation. – Piotr Findeisen Mar 15 '16 at 22:41

2 Answers2

7
  1. Indexes are typically useless for operations on 90% of all rows. Sequential scans will be faster either way. (Exotic exceptions apply.)

  2. If you need to allow concurrent reads, you cannot take an exclusive lock on the table. So you also cannot drop any indexes in the same transaction.

  3. You could drop indexes in separate transactions to keep the duration of the exclusive lock at a minimum. In Postgres 9.2 or later you can also use DROP INDEX CONCURRENTLY, which only needs minimal locks. Later use CREATE INDEX CONCURRENTLY to rebuild the index in the background - and only take a very brief exclusive lock.

If you have a stable condition to identify the 10 % (or less) of rows that stay, I would suggest a partial index on just those rows to get the best for both:

  • Reading queries can access the table quickly (using the partial index) at all times.
  • The big DELETE is not going to modify the partial index at all, since none of the rows are involved in the DELETE.
CREATE INDEX foo (some_id) WHERE delete_flag = FALSE;

Assuming delete_flag is boolean. You have to include the same predicate in your queries (even if it seems logically redundant) to make sure Postgres can use the partial index.

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
  • I very much like the idea of adding a partial index on the 10%, speeding up reads (on the 10%) without hurting runtime of the deletion. But since you don't mention it in your answer, should I assume that deletion-in-batches is unnecessary for this task? – Simon Lepkin Feb 03 '16 at 22:54
  • 1
    @SimonLepkin: I don't see anything in the question that would require deletion in batches. Typically, errors half way into a delete are extremely unlikely. If you have a different situation (maybe triggers that might interfere?) then deleting in batches might make sense. There is also the question if intermediary states with only some rows deleted shall be visible to the rest of the world - if not, you need to do it in a single transaction. – Erwin Brandstetter Feb 03 '16 at 23:10
0

delete using batches of specific size and sleep between deletes:

create temp table t as
select id from tbl where ...;
create index on t(id);

do $$
declare sleep int = 5;
declare batch_size int = 10000;
declare c refcursor;
declare cur_id int = 0;
declare seq_id int = 0;
declare del_id int = 0;
declare ts timestamp;
begin
    <<top>>
    loop
        raise notice 'sleep % sec', sleep;
        perform pg_sleep(sleep);
        raise notice 'continue..';
        open c for select id from t order by id;
        <<inn>>
        loop
            fetch from c into cur_id;
            seq_id = seq_id + 1;
            del_id = del_id + 1;
            if cur_id is null then
                raise notice 'goin to del end: %', del_id;
                ts = current_timestamp;
                close c;
                delete from tbl tb using t where tb.id = t.id;
                delete from t;
                commit;
                raise notice 'ok: %s', current_timestamp - ts;
                exit top;
            elsif seq_id >= batch_size then
                raise notice 'goin to del: %', del_id;
                ts = current_timestamp;
                delete from tbl tb using t where t.id = tb.id and t.id <= cur_id;
                delete from t where id <= cur_id;
                close c;
                commit;
                raise notice 'ok: %s', current_timestamp - ts;
                seq_id = 0;
                exit inn;
            end if;
        end loop inn;
    end loop top;
end;
$$;
vividsnow
  • 21
  • 3