15

I need to return all values from colA that are not in colB from mytable. I am using:

SELECT DISTINCT(colA) FROM mytable WHERE colA NOT IN (SELECT colB FROM mytable)

It is working however the query is taking an excessively long time to complete.

Is there a more efficient way to do this?

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
Flash
  • 15,945
  • 13
  • 70
  • 98
  • `NOT IN` slows down as the set size grows, and there's often a limit to how many rows can be in the `NOT IN` clause. Outside of small result sets I've found it better to use other means to get the difference between two result sets. – Paul Jan 05 '12 at 03:16
  • When talking about performance, you must name your RDBMS or get suboptimal answers. – Erwin Brandstetter Jan 05 '12 at 06:18

3 Answers3

27

In standard SQL there are no parentheses in DISTINCT colA. DISTINCT is not a function.

SELECT DISTINCT colA
FROM   mytable
WHERE  colA NOT IN (SELECT DISTINCT colB FROM mytable);

Added DISTINCT to the sub-select as well. If you have many duplicates it could speed up the query.

A CTE might be faster, depending on your DBMS. I additionally demonstrate LEFT JOIN as alternative to exclude the values in valB, and an alternative way to get distinct values with GROUP BY:

WITH x AS (SELECT colB FROM mytable GROUP BY colB)
SELECT m.colA
FROM   mytable m
LEFT   JOIN x ON x.colB = m.colA
WHERE  x.colB IS NULL
GROUP  BY m.colA;

Or, simplified further, and with a plain subquery (probably fastest):

SELECT DISTINCT m.colA
FROM   mytable m
LEFT   JOIN mytable x ON x.colB = m.colA
WHERE  x.colB IS NULL;

There are basically 4 techniques to exclude rows with keys present in another (or the same) table:

The deciding factor for speed will be indexes. You need to have indexes on colA and colB for this query to be fast.

Community
  • 1
  • 1
Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
  • Thanks, I have tried: `SELECT DISTINCT m1.colA FROM mytable m1 LEFT JOIN mytable m2 ON (m1.colA = m2.colB) WHERE m2.colA IS NULL ORDER BY m1.colA ASC` and it is several orders of magnitude faster and seems to be working - is this equivalent to the code I posted in the question? It's so much quicker that I'm suspicious I might have missed something. – Flash Jan 05 '12 at 05:55
  • @Andrew: Sorry, you got a typo in your query. Must be `WHERE m2.colB IS NULL`. The (corrected) query might be faster with `LEFT JOIN (SELECT DISTINCT colB FROM mytable) m2 ON m2.colB = m1.colA` **if** there are many duplicate values for `colB`. – Erwin Brandstetter Jan 05 '12 at 06:00
  • @Andrew: `m2.colA` is always `NULL` if `m2.colB` is NULL here, but `m2.colA` can be NULL even if `m2.colB` is not. So the correct (and faster!) form here is: `WHERE m2.colB IS NULL`. **If** `colA` is defined NOT NULL, then your above query is correct. – Erwin Brandstetter Jan 05 '12 at 06:06
  • Thanks, the queries are giving me the same number of results I suppose because there are no `NULL` values. Do you know why it is so much faster? – Flash Jan 05 '12 at 06:23
  • @Andrew: Different query plans perform differently. I can't say much without knowing which RDBMS you are using. You might want to edit that information into your question. Information about indexes would be instrumental, too. – Erwin Brandstetter Jan 05 '12 at 06:26
  • Ok, many thanks. I made the `DISTINCT` change you suggested and it now completes in a few seconds. – Flash Jan 05 '12 at 06:29
6

You can use exists:

select distinct
    colA
from
    mytable m1
where
    not exists (select 1 from mytable m2 where m2.colB = m1.colA)

exists does a semi-join to quickly match the values. not in completes the entire result set and then does an or on it. exists is typically faster for values in tables.

Eric
  • 92,005
  • 12
  • 114
  • 115
  • @Andrew - Sure! It says, grab the distinct `colA`s where there's no row from `mytable` that `colB` is equal to that `colA`. – Eric Jan 05 '12 at 04:44
0

You can use the EXCEPT operator which effectively diffs two SELECT queries. EXCEPT DISTINCT will return only unique values. Oracle's MINUS operator is equivalent to EXCEPT DISTINCT.

Paul
  • 19,704
  • 14
  • 78
  • 96