59

I have 2 tables (A and B) with the same primary keys. I want to select all row that are in A and not in B. The following works:

select * from A where not exists (select * from B where A.pk=B.pk);

however it seems quite bad (~2 sec on only 100k rows in A and 3-10k less in B)

Is there a better way to run this? Perhaps as a left join?

select * from A left join B on A.x=B.y where B.y is null;

On my data this seems to run slightly faster (~10%) but what about in general?

BCS
  • 75,627
  • 68
  • 187
  • 294
  • Do you mean you have 100k rows in table A and 100k rows in table B? Or just around 300 rows in each table, meaning 100,000 row scans (or, hopefully, index scans). – ChrisInEdmonton Dec 09 '08 at 20:00

5 Answers5

62

I think your last statement is the best way. You can also try

SELECT A.*    
from A left join B on 
    A.x = B.y
    where B.y is null
Smok
  • 246
  • 1
  • 7
Nick Berardi
  • 54,393
  • 15
  • 113
  • 135
  • 2
    That doesn't make any sense. When B.y is null, A.x=B.y will never be true. You'll get all rows of A, not just those that have no matching row in B. – Bill Karwin Dec 09 '08 at 20:46
  • 2
    @Bill and yet it works! Plus it is the exact same thing as the second statement listed above. – Nick Berardi Dec 11 '08 at 14:12
  • 1
    @hamstar, yes, now it's correct, by changing `and` to `where`. – Bill Karwin Oct 16 '12 at 17:23
  • 2
    @Smok "I think your last statement is the best way. You can also try..." ~ Is yours not identical? – Neil Hillman Feb 19 '14 at 14:06
  • @BillKarwin It makes a lot of sense. Left joins include every A record at least once--once for each matching B, or one with NULL in the B columns if there is no matching B. From http://www.w3schools.com/sql/sql_join_left.asp: "The LEFT JOIN keyword returns all rows from the left table (table1), with the matching rows in the right table (table2). The result is NULL in the right side when there is no match." The A's that have a mathcing B will have a non-null B.y for every occurence, while the A's with no B have one occurence with a null B.y – Isaac Betesh Jun 20 '14 at 15:57
  • 1
    @IsaacBetesh, thanks, but you should be aware that I posted my first comment above before hamstar's edit to fix the query logic. Check the edit history to see the original answer, which was wrong. – Bill Karwin Jun 20 '14 at 16:13
37

I use queries in the format of your second example. A join is usually more scalable than a correlated subquery.

Bill Karwin
  • 538,548
  • 86
  • 673
  • 828
2

Joins are generally faster (in MySQL), but you should also consider your indexing scheme if you find that it's still moving slowly. Generally, any field setup as a foreign key (using INNODB) will already have an index set. If you're using MYISAM, make sure that any columns in the ON statement are indexed, and consider also adding any columns in the WHERE clause to the end of the index, to make it a covering index. This allows the engine to have access to all the data needed in the index, removing the need to make a second round-trip back to the original data. Keep in mind that this will impact the speed of inserts/updates/deletes, but can significantly increase the speed of the query.

ChoNuff
  • 814
  • 6
  • 12
  • Actually, Raul, your edit makes it sound like you should create a separate index for the columns in the WHERE clause. For clarity, I meant to add the columns to the index (covering index), so the engine wouldn't have to make a round-trip back to the base data for the WHERE clause. Apologies for not being clear. – ChoNuff Aug 21 '14 at 04:31
2

I also use left joins with a "where table2.id is null" type criteria.

Certainly seems to be more efficient than the nested query option.

Dave Rix
  • 1,621
  • 1
  • 11
  • 17
-2

This helped me a lot. Joins are always faster than Sub Queries to give results:

SELECT tbl1.id FROM tbl1 t1
LEFT OUTER JOIN tbl2 t2 ON t1.id = t2.id 
WHERE t1.id>=100 AND t2.id IS NULL ;
Jamie Rees
  • 7,973
  • 2
  • 45
  • 83