Speed up SQL NOT IN query

Question

I have a query as part of my process, that I execute hundreds of times in a loop.

Initially, Table A contains all records (20 mil). Table B contains 0 records.

The primary key in both tables is ID.

The query I execute is:

select * from A where a.ID not in (select ID from b) limit 10000
##magic stuff in python
insert everything to table B, once again, .

Initially the query runs super fast, but after the Nth loop (100th+), the size of table B increases to the point where it takes a bit of time to perform the NOT IN operation.

Does anyone have recommendations on how I can speed up the query?

So far, I've tweaked the default mysql bugger to be 1.5gbs (ids are pretty small INTs, so that should be enough).

Caveats:

1) One way to do this would be to remove * from table A after I've processed them. However, I want to keep table A in tact.

The only method I could think of is adding another column to table A (which I'd index) called PROCESSED, and then update that column with a second query once the records have been processed/posted. But I was hoping there was an easier solution.

note, after 5 mins 1M records were processed in this fashion, it took 10 mins to process the second Mil. going to guess that this will keep increasing exponentially? — FlyingZebra1, Aug 30 '19 at 22:39
Since this question has been closed, without proper answer, ill leave the solution here. Even though the "not in" query ran on fields that were indexed - the query was still slow. I ended up updating my table to include a 'processed' flag (0/1), which I also indexed, and tacked on an update statement after each loop, that marked the record as 1 - processed, after each loop. Even though this was a bit of work upfront - the query now runs as fast during 100th loop, as it did during first loop. Best of luck! — FlyingZebra1, Oct 13 '19 at 22:48

score 0 · Answer 1 · answered Aug 30 '19 at 22:45

0

Subquery vs Join

select <column list>  
from 
    A left join
    B on a.id = b.id
where b.id is null 
limit 10000

And wrap it into insert

INSERT INTO b (<column list from B>) 
select <column list from A>  
from 
    A left join
    B on a.id = b.id
where b.id is null 
limit 10000

This should do it

answered Aug 30 '19 at 22:45

T.S.

18,195
11
58
78

from everything im seeing in the internets, the join is pretty much the same as subquery, performance-wise, no? – FlyingZebra1 Aug 30 '19 at 23:01
1

@FlyingZebra1 Depends. If join related tables - Join is best. I feel that join in this case will be better than `NOT IN`. You might want to create temporary relationship between `ID` columns. There are a lot of variables. But one is guarantee, subquery [practically] never will be faster than join. A lot depends also on DB engine, the optimizer etc. BTW validator gave this query 100% score https://www.eversql.com/sql-syntax-check-validator/ – T.S. Aug 30 '19 at 23:21
so I did modify the query to run it with a join, and it looks like the performance is pretty much the same (within 5%). Basically nothing crazy. I DID find though, that pulling in 100k records per loop with this query is pretty much the same amount of time as 10k records, so I will be increasing the 'records to process per loop' by 10-20x, and see how she does. – FlyingZebra1 Aug 31 '19 at 00:38
working like a champ @ 30X increase on the LIMIT section , pretty much the same processing time per loop :) – FlyingZebra1 Aug 31 '19 at 00:42
So, if you can process more records in the same time, you get the boost, no? And now, set the reference from `b.id` to `a.is` as foreign key, it should do even more perf. All these performance items are theoretical work. Performance tuning is a whole separate subject. – T.S. Aug 31 '19 at 00:44
yup, basically the bottle neck in the script was trying to do the query i mentioned, just for 10k records at a time. I increased the LIMIT section of the query from 10k to 1M, and doing 1M records at a time seem to take pretty much the same time to pull out, so now instead of doing 2k iterations of the query, i'll just do 20 iterations :), in A LOT less time. I haven't done much with keys, so will check out what Foreign key does, but it doesn't look like I have much need for it now. – FlyingZebra1 Aug 31 '19 at 00:53
If you you have a key and a join, I expect a solid performance boost – T.S. Aug 31 '19 at 00:55

score 0 · Answer 2 · answered Aug 30 '19 at 22:47

0

Perhaps create an index on ID? If table B is initially empty, then you're just taking the unique ID's from table A. You can do that in a single insert query. Make sure MySQL is tuned for performance on large tables as well.

answered Aug 30 '19 at 22:47

IronMan

1,854
10
7

ID is primary key in both tables, so that would mean that its already indexed? – FlyingZebra1 Aug 30 '19 at 22:53
Yes, an index should be automatically created. Since `ID` is unique in A, `insert into B select * from A` would be all you need (creating the index after the bulk insert would actually be fastest). – IronMan Aug 30 '19 at 22:56

Speed up SQL NOT IN query

2 Answers2