75

I have a 1:1 relationship between two tables. I want to find all the rows in table A that don't have a corresponding row in table B. I use this query:

SELECT id 
  FROM tableA 
 WHERE id NOT IN (SELECT id 
                    FROM tableB) 
ORDER BY id desc

id is the primary key in both tables. Apart from primary key indices, I also have a index on tableA(id desc).

Using H2 (Java embedded database), this results in a full table scan of tableB. I want to avoid a full table scan.

How can I rewrite this query to run quickly? What index should I should?

OMG Ponies
  • 325,700
  • 82
  • 523
  • 502
Steve McLeod
  • 51,737
  • 47
  • 128
  • 184
  • 1
    every time you write 'WHERE col [NOT] IN (SELECT col FROM othertable)' you are better off refactoring using [NOT] EXISTS. – topchef Sep 14 '09 at 02:28

6 Answers6

104
select tableA.id from tableA left outer join tableB on (tableA.id = tableB.id)
where tableB.id is null
order by tableA.id desc 

If your db knows how to do index intersections, this will only touch the primary key index

SquareCog
  • 19,421
  • 8
  • 49
  • 63
  • 13
    This is why I love Stack Overflow. Saturday, SQL problem - question answered precisely and successfully in 5 minutes! – Steve McLeod Sep 12 '09 at 16:16
  • 2
    you got some good suggestions in the other answers as well. Naturally I think mine will be fastest :-) but db implementations vary widely, and I've no experience with H2. It would be great if you benchmarked the different approaches and updated the question with your results. – SquareCog Sep 12 '09 at 17:06
35

You can also use exists, since sometimes it's faster than left join. You'd have to benchmark them to figure out which one you want to use.

select
    id
from
    tableA a
where
    not exists
    (select 1 from tableB b where b.id = a.id)

To show that exists can be more efficient than a left join, here's the execution plans of these queries in SQL Server 2008:

left join - total subtree cost: 1.09724:

left join

exists - total subtree cost: 1.07421:

exists

Eric
  • 92,005
  • 12
  • 114
  • 115
  • 1
    +1: The EXISTS condition is considered "to be met" if the subquery (correllated in this case) returns at least one row. – OMG Ponies Sep 12 '09 at 16:34
  • benchmarking is a good idea. I am wracking my brain trying to figure out what a db could be doing under the covers for exists+correlated subquery that would make it faster than an index-only hash join. Do you know? – SquareCog Sep 12 '09 at 17:20
  • 2
    `Exists` isn't using your standard correlated subquery. It uses a semi-join. The execution plan on SQL Server 2008 for `left join` is two index scans to a hash match to a filter to a select. For the `not exists`, it is two index scans to a hash match to a select--no filter. The `exists` hash match is actually slightly faster than the `left join`. `left join` has a total cost of 1.09, `not exists` of 1.07 on `DimCustomer` for `AdventureWorksDW` to `AdventureWorksDW2008`. – Eric Sep 12 '09 at 17:56
  • Nice!! Thanks. That's one smart optimizer. Granted the cost is approximate, but I buy it on the filter vs semijoin principle. – SquareCog Sep 13 '09 at 01:34
7

You have to check every ID in tableA against every ID in tableB. A fully featured RDBMS (such as Oracle) would be able to optimize that into an INDEX FULL FAST SCAN and not touch the table at all. I don't know whether H2's optimizer is as smart as that.

H2 does support the MINUS syntax so you should try this

select id from tableA
minus
select id from tableB
order by id desc

That may perform faster; it is certainly worth benchmarking.

APC
  • 144,005
  • 19
  • 170
  • 281
6

For my small dataset, Oracle gives almost all of these queries the exact same plan that uses the primary key indexes without touching the table. The exception is the MINUS version which manages to do fewer consistent gets despite the higher plan cost.

--Create Sample Data.
d r o p table tableA;
d r o p table tableB;

create table tableA as (
   select rownum-1 ID, chr(rownum-1+70) bb, chr(rownum-1+100) cc 
      from dual connect by rownum<=4
);

create table tableB as (
   select rownum ID, chr(rownum+70) data1, chr(rownum+100) cc from dual
   UNION ALL
   select rownum+2 ID, chr(rownum+70) data1, chr(rownum+100) cc 
      from dual connect by rownum<=3
);

a l t e r table tableA Add Primary Key (ID);
a l t e r table tableB Add Primary Key (ID);

--View Tables.
select * from tableA;
select * from tableB;

--Find all rows in tableA that don't have a corresponding row in tableB.

--Method 1.
SELECT id FROM tableA WHERE id NOT IN (SELECT id FROM tableB) ORDER BY id DESC;

--Method 2.
SELECT tableA.id FROM tableA LEFT JOIN tableB ON (tableA.id = tableB.id)
WHERE tableB.id IS NULL ORDER BY tableA.id DESC;

--Method 3.
SELECT id FROM tableA a WHERE NOT EXISTS (SELECT 1 FROM tableB b WHERE b.id = a.id) 
   ORDER BY id DESC;

--Method 4.
SELECT id FROM tableA
MINUS
SELECT id FROM tableB ORDER BY id DESC;
Leigh Riffel
  • 6,381
  • 3
  • 34
  • 47
3

I can't tell you which of these methods will be best on H2 (or even if all of them will work), but I did write an article detailing all of the (good) methods available in TSQL. You can give them a shot and see if any of them works for you:

http://code.msdn.microsoft.com/SQLExamples/Wiki/View.aspx?title=QueryBasedUponAbsenceOfData&referringTitle=Home

Aaron Alton
  • 22,728
  • 6
  • 34
  • 32
-1
select parentTable.id from parentTable
left outer join childTable on (parentTable.id = childTable.parentTableID) 
where childTable.id is null
Striezel
  • 3,693
  • 7
  • 23
  • 37