2

My query needs to return all usage records whose pipeline rate is not 'No Usage'.

What's the difference between NOT EXISTS vs. NOT IN vs. LEFT JOIN WHERE IS NULL?

I have seen the above question and decided to use IN over EXISTS as the values in the tables are nullable. Which one of the following is better and more efficient or is there any other way which more efficient than the following two?

SELECT *
FROM   usagerecords UR
WHERE  UR.usagerateid NOT IN (SELECT id
                              FROM   pipelinerate PR
                              WHERE  PR.name = 'No Usage')

SELECT *
FROM   usagerecords UR
WHERE  UR.usagerateid IN (SELECT id
                          FROM   pipelinerate PR
                          WHERE  PR.name <> 'No Usage') 
Community
  • 1
  • 1
Ram
  • 3,092
  • 10
  • 40
  • 56

2 Answers2

4

NOT IN is going to give you the wrong results if id is nullable (which I hope it is not, otherwise it has a terrible name).

Why would you choose IN over EXISTS when it has been proven time and time again that EXISTS is more efficient (or at least no less efficient), since it can short-circuit? IN has to materialize the entire set.

SELECT * -- stop doing this
  FROM dbo.usagerecords AS UR
  WHERE EXISTS
  (
    SELECT 1 FROM dbo.pipelinerate AS pr
      WHERE pr.id = ur.usagerateid
      AND pr.name <> 'No Usage'
  );

You can also express your other query like this:

SELECT * -- again, stop doing this
  FROM dbo.usagerecords AS UR
  WHERE NOT EXISTS 
  (
    SELECT 1 FROM dbo.pipelinerate AS pr
      WHERE pr.id = ur.usagerateid
      AND pr.name = 'No Usage'
  );

But I have no idea which, if either, gets the correct results. This is why we typically ask for sample data and desired results.

Your use of SELECT * is likely to have a greater negative impact on performance than whether you use IN or EXISTS. FWIW.

Aaron Bertrand
  • 272,866
  • 37
  • 466
  • 490
  • This query is a part of the larger query and I am not using SELECT * I just mentioned it for simplification and id is not nullable. I have tired both `EXISTS` and `IN`. I am getting unnecessary records when using `EXISTS` but the right records when using `IN` – Ram Apr 18 '13 at 16:03
  • 2
    @Srikanth Show sample data and desired results. Otherwise we're just guessing. – Aaron Bertrand Apr 18 '13 at 16:04
  • While trying `EXISTS` I wrote `pr.id <> ur.usagerateid AND pr.name <> 'No Usage'` instead of `pr.id = ur.usagerateid AND pr.name <> 'No Usage'`. thanks it works and is faster. Which one would you recommend `NOT EXISTS` or `EXISTS`? – Ram Apr 18 '13 at 17:13
  • Well, that depends! Do they both give the right results? Did you test that? Do you know what results you expect? – Aaron Bertrand Apr 18 '13 at 17:34
  • Yes, Both gave right and expected results. Both take same time on average and execution plans are similar. – Ram Apr 18 '13 at 17:42
0

"What's the difference between NOT EXISTS vs. NOT IN vs. LEFT JOIN WHERE IS NULL?"

Not exists exits as soon as it sees one match.

Not in doesn't. People get concerned about in or not in lists that contain large numbers of results, but some testing by Jeff Moden shows that they work fine up to the million item range, which is usually sufficient.

Left join where is null is set based, so it's the "classic" solution. "In" is basically turned into a giant or list. Left join where is null doesn't have any specific advantages when you're just testing for something being missed, I like to use a left self join/is null pattern when I'm looking for say the most recent event per user.

Not in is super simple, and any novice developer will understand what it does.

Not exists is almost as clear, but maybe above novice.

Left join/is null is routinely misunderstood even by mid-level developers. So personally I find not in the most maintainable.

Brian White
  • 1,265
  • 1
  • 10
  • 16
  • regardless of Moden's testing, why code one way when you know your data is small, and a different way (but no harder to code) when you know your data is large? When your data becomes large, are you going to go change all of your code? What if you don't know if/when your data becomes large? Standard principles apply here: consistently use the most efficient approach. – Aaron Bertrand Apr 18 '13 at 16:05
  • Also, while this "question" was listed in the original post, it is just a reference link, and isn't the question the OP is actually asking. – Aaron Bertrand Apr 18 '13 at 16:06
  • The Jeff Moden tests may have been RDBMS specific. My unscientific observations are that "not in" is very slow with a much lower threshold using redbrick databases. – Dan Bracuk Apr 18 '13 at 16:10
  • "Which one of the following is better and more efficient or is there any other way which more efficient than the following two?" - Very subjective. I would define "better" as easier for mid-level and below developers to understand. Developer time costs more than database time in most cases, and bugs are expensive beyond belief. – Brian White Apr 19 '13 at 22:33
  • I mostly read Moden on sqlservercentral, so it's possible it was only referring to sql server. Writing a standard way is fine. Pre-optimizing should be avoided. The developers working on query optimizers do a lot to mitigate common silly things that devs do – Brian White Apr 19 '13 at 23:10