Very slow subqueries when using "NOT IN"

Question

I'm working on generating reports for data contained within a large pre-existing Access database (~500 mb after compact & repair), and I'm having trouble with a slow subquery.

The database has a big table which contains a record of every customer purchase. Here's a simple query which finds customers who have bought a blue widget. It completes within a few seconds and returns about ten thousand records.

SELECT DISTINCT CustomerId 
FROM ProductSales
WHERE Product = 'BLUE'

Here's a query which tries to find customers who have bought a blue widget, but not a red widget. It takes about an hour to run.

SELECT DISTINCT CustomerId FROM ProductSales
WHERE Product = 'BLUE' 
AND CustomerId NOT IN (
    SELECT CustomerId 
    FROM ProductSales 
    WHERE Product = 'RED'
)

Is there a way to refactor the second query to make it take a few minutes instead of an hour?

I presume the CustomerId field has indexes on it in both tables? — Marc B, Aug 09 '11 at 17:04
Have you tried SELECT DISTINCT CustomerId FROM ProductSales WHERE Product = 'BLUE' minus SELECT CustomerId FROM ProductSales WHERE Product = 'RED'. I've seen cases where it really speeded up the query, but YMMV — JB Nizet, Aug 09 '11 at 17:08
@Marc B: There's only one table here, but CustomerId is indexed on it. — James, Aug 09 '11 at 18:15

HansUp · Accepted Answer · 2014-02-06T02:12:41.217

13

Access' database engine can't use an index for Not In, so it's bound to be slow. With an index on CustomerId, this query should be much faster because the db engine can use the index.

SELECT DISTINCT blue.CustomerId
FROM
    ProductSales AS blue
    LEFT JOIN
        (
            SELECT CustomerId 
            FROM ProductSales 
            WHERE Product = 'RED'
        ) AS red
    ON blue.CustomerId = red.CustomerId
WHERE
        blue.Product = 'BLUE'
    AND red.CustomerId Is Null;

You could probably also try a Not Exists approach, but index use there is not guaranteed. Also, please see the comment below from David Fenton which discusses performance impact in more detail.

edited Feb 06 '14 at 02:12

answered Aug 09 '11 at 17:33

HansUp

95,961
11
77
135

1

Woah. Down from 60 minutes to about a second. That's like 3600 times faster than the old version. Thanks! :-) For future reference, is there a list of SQL operations Access doesn't use the index for somewhere? – James Aug 09 '11 at 18:30
1

There probably is a list somewhere, but I don't know where. :-) If you want to go hardcore on this, Google Jet ShowPlan ... that will tell you authoritatively how/if the db engine is using indexes with your query. You can find detailed discussion of query performance here: http://msdn.microsoft.com/en-us/library/aa188211(office.10).aspx – HansUp Aug 09 '11 at 18:45
1

It's not that NOT IN and NOT EXISTS never use the indexes -- it's that you can't predict when they will and when they won't. Any time you can re-engineer a NOT IN/EXISTS subquery into a JOIN, you'll likely improve performance. However, it all depends on editability of the resulting recordset -- some subqueries used in a JOIN will make the query non-updatable. – David-W-Fenton Aug 10 '11 at 21:14
I've been waiting for years for Microsoft to sunset Access and JET, but they don't, so I have to keep digging up old gems like this one. The left join is simple, remembering this is the faster method is not. – Kevin Welsh Oct 21 '20 at 13:22

score 0 · Answer 2 · answered Aug 09 '11 at 17:06

Add an index, of course, if you don't have one. If that's a problem, it's probably just that there are lots of customers with orders for something other than RED but not so many with BLUE; this (untested) query tries to fix that.

SELECT DISTINCT CustomerId FROM ProductSales
LEFT JOIN (
  SELECT DISTINCT CustomerId cid FROM ProductSales
  LEFT JOIN (
    SELECT DISTINCT CustomerId
    FROM ProductSales
    WHERE Product = 'BLUE'
  ) foo ON CustomerId = cid
  WHERE Product = 'RED'
) bar USING (CustomerId)
WHERE cid IS NULL

Very slow subqueries when using "NOT IN"

2 Answers2

Linked