Optimising "NOT IN(...)" query for millions of rows

Question

Note: I do not have access to the source code/database to which this question pertains. The two tables in question are located on different servers.

I'm working with a 3rd party company that have systems integrated with our own. They have a query that runs something like this;

DELETE FROM table WHERE column NOT IN(1,2,3,4,5,.....3 000 000)

It's pretty much referencing around 3 million values in the NOT IN.

I'm trying to point out that this seems like an inefficient method for deleting multiple rows and keeping all the ones noted in the query. The problem is, as I don't have the access myself to the source code/database I'm not totally sure what to suggest as a solution.

I know the idea of this query is to keep a target server synced up with a source server. So if a row is deleted on the source server, the target server will reflect that change when this (and other) query is run.

With this limited knowledge, what possible suggestions could I present to them?

The first thing that comes to mind is having some kind of flag column that indicates whether it's been deleted or not. When the sync script runs it would first perform an update on the target server for all rows marked as deleted (or insert for new rows), then a second query to delete all rows marked for deletion.

Is there more logical way to do something like this, bearing in mind complete overhauls in functionality are out of the question. Only small tweaks to the current process will be possible for a number of reasons.

You could use a `left join` instead of `not in()` – juergen d Nov 16 '16 at 10:16 — juergen d, Nov 16 '16 at 10:16

score 1 · Answer 1 · answered Nov 16 '16 at 10:19

1

Instead of

DELETE FROM your_table 
WHERE column NOT IN(1,2,3,4,5,.....3 000 000)

you could do

delete t1
from your_table t1
left join table_where_the_ids_come_from t2 on t1.column = t2.id
where t2.id is null

answered Nov 16 '16 at 10:19

juergen d

201,996
37
293
362

these tables are on different servers (same data centre), would this still work? – Novocaine Nov 16 '16 at 10:21
Maybe if you change your table. If it is the same server but different databases then it is easy: http://stackoverflow.com/questions/5698378/mysql-join-between-tables-in-2-different-databases. But different servers requires a structure change: http://stackoverflow.com/questions/11114197/join-tables-from-two-different-server – juergen d Nov 16 '16 at 10:27
1

I presume that the list is already passed across (to be used in the NOT IN clause). If needs be you could push these rows into a table (possibly a temp table) and then LEFT OUTER JOIN against that. – Kickstart Nov 16 '16 at 12:41

score 1 · Answer 2 · answered Nov 17 '16 at 00:04

I know the idea of this query is to keep a target server synced up with a source server. So if a row is deleted on the source server, the target server will reflect that change when this (and other) query is run.

I know this is obvious, but why don't these two servers stay in sync using replication? I'm guessing it's because aside from this one table, they don't have identical data.

If out-of-the-box replication isn't flexible enough, you could use a change-data capture tool.

The idea is that the tool monitors changes in a MySQL binary log stream, and reacts to them. The reaction is user-defined, and it can include applying the same change to another MySQL instance, which would keep them in sync.

Here's a blog that shows how to use Maxwell, which is one of the open-source CDC tools, this one released from Zendesk: https://www.percona.com/blog/2016/09/13/mysql-cdc-streaming-binary-logs-and-asynchronous-triggers/

A couple of advantages of this approach:

No need to re-sync the whole table. You'd only apply incremental changes as they occur.
No need to schedule re-syncs daily or whatever. Since incremental changes are likely to be small, you could apply the changes nearly immediately.

You are correct. The two databases do not have identical data by design. CDC does sound like a nice solution, however the cost (time + billable hours) of implementing it is probably too high considering it would have to be the 3rd party that would be required to do this. I'd certainly look at doing this on some other internal projects though. — Novocaine, Nov 17 '16 at 09:43
I sympathize. I used to do consulting, but I got frustrated with client after client telling me, "we can't change code, we can't change schema, we can't upgrade MySQL, we can't upgrade hardware, we can't even allow more than 10 seconds of downtime, ever. But we want you to wave a magic wand and fix our totally broken architecture anyway. But we don't want to spend a lot of money." — Bill Karwin, Nov 17 '16 at 15:49

Rick James · Answer 3 · 2016-11-17T01:08:53.680

Deleting a large number of rows will take a huge amount of time. This is likely to require a full table scan. As it finds rows to delete, it will stress the undo/redo log. It will clog replication (if using such). Etc.

How many rows do you expect to delete?

Better would be to break the list up into chunks of 1000. (This applies whether using IN(list of constants) or JOIN.) But, since you are doing NOT, it gets stickier. Possibly the best way is to copy over what you want:

CREATE TABLE new LIKE real;
INSERT INTO new
    SELECT * FROM real WHERE id IN (...);  -- without NOT
RENAME TABLE real TO old,
             new TO real;
DROP TABLE old;

I go into details of chunking, partitioning, and other techniques in Big Deletes .

He isn't deleting 3 million rows. He's deleting rows that don't match any of 3 million *values*. It might end up deleting zero rows, especially if no rows have been deleted in the source instance since the last re-sync. — Bill Karwin, Nov 17 '16 at 00:06

Optimising "NOT IN(...)" query for millions of rows

3 Answers3