1

I am using the following query to get zero results on Amazon Redshift.

Query 1:

SELECT id
FROM flx2.groups
WHERE id NOT IN (SELECT groupid FROM flx2.lmsprovidergroups)

Now, the same query starts to give me results if I modify it slightly to:

Query 2:

SELECT id
FROM flx2.groups
WHERE id NOT IN (SELECT id
                 FROM flx2.groups
                 WHERE id IN (SELECT groupid FROM flx2.lmsprovidergroups))

I'm trying to exclude the id from lmsprovidergroups to get a subset from groups. Why would I have to include it first (as in the inner query in query 2) and then exclude it again?

As far as I'm concerned, query 1 is absolutely the same as query 2.
Why does query 2 work while query 1 does not?

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
Patthebug
  • 4,647
  • 11
  • 50
  • 91
  • It would be easier with a `LEFT JOIN`. And the two queries are totally different. – Jakub Kania Feb 16 '16 at 23:34
  • @JakubKania: Would it be possible for you to throw some light on how these queries are different? I know I came up with them, but I fail to see how they're different. I mean, sure, one yields results, the other one doesn't, they have to be different. But how, why? – Patthebug Feb 16 '16 at 23:43
  • Any `nulls` in `flx2.groups.id`? Btw: the `distinct` in a subselect for an `IN` or `NOT IN` clause is useless –  Feb 16 '16 at 23:44
  • @a_horse_with_no_name: I agree `distinct` is useless in this case. No `nulls` in `flx2.groups.id`. – Patthebug Feb 16 '16 at 23:45
  • 2
    Any nulls in `flx2.lmsprovidergroups.groupid`? –  Feb 16 '16 at 23:54
  • Yes, there are `nulls` in `flx2.lmsprovidergroups.groupid`. – Patthebug Feb 17 '16 at 16:33
  • I agree with the left join: SELECT g.id FROM flx2.groups g LEFT JOIN flx2.lmsprovidergroups pg ON pg.groupid = g.id WHERE pg.groupid IS NULL Try to check performance just in case – borjab Feb 23 '16 at 18:46

2 Answers2

3

If the sub-query for a NOT IN condition returns at least one NULL value the NOT IN evaluates to "unknown" which results in no rows being matched at all.

You need to eliminate the null values from the sub-query:

SELECT id
FROM flx2.groups
WHERE id NOT IN (SELECT groupid 
                 FROM flx2.lmsprovidergroups
                 WHERE groupid is not null)

Your second query essentially does the same thing: it eliminates the null values from the sub-query (assuming groups.id does not contain any nulls)

1

NOT IN (SELECT ...) is almost always a bad choice. Not only does it exhibit "surprising" behavior with NULL values on either side (if you are not not familiar with the logic), it's also typically slower than the superior alternative with NOT EXISTS:

SELECT id
FROM   flx2.groups g
WHERE  NOT EXISTS (SELECT 1 FROM flx2.lmsprovidergroups
                   WHERE  groupid = g.id);

There are other standard techniques:

Community
  • 1
  • 1
Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
  • You opened my eyes to something I had no clue about. Thanks a lot. I have been using `NOT IN` in pretty much every query that I write before I ran into this problem. Now I know better and thanks to you :). – Patthebug Feb 25 '16 at 19:43