0

So I'm testing 2 queries and I'm getting different results. I want to correct/patch up my understanding. Here's two generic SQL queries that to my understanding are the same but when executed get different results. Note this is not a question about diff between ANSI and non-ANSI SQL.

Query 1 (using LEFT JOIN):

SELECT * FROM person p LEFT JOIN person_log pl
ON p.person_id = pl.person_id
WHERE pl.person_id IS NULL
AND p.is_active = 1;

Query 2 (using 2 queries):

SELECT * FROM person
WHERE person.is_active = 1
AND person_id NOT IN (SELECT person_id FROM person_log);

To my understanding, both represent this in venn diagram form. Also, is one more efficient than the other? A query on JOIN results vs 2 queries?

EDIT: Changed = to IS in query 1. Thanks to @Justin Samuel for spotting the = error that's causing different results!

Community
  • 1
  • 1
Sticky
  • 3,671
  • 5
  • 34
  • 58
  • 2
    Without optimizer stepping in, and on lees advanced DBs, #1 should be faster and scaleable. Optimiser should convert #2 to #1 – Bohemian Sep 21 '16 at 19:14
  • Joins can't be correctly represented with venn diagrams, is just a simplified way to explain it visually – Lamak Sep 21 '16 at 19:15
  • 1
    @Bohemian This isn't always true. If there are `NULL`s in the `person_id` column of the `person_log` table, then the results will be completely different – Lamak Sep 21 '16 at 19:16
  • Agreed they generally are the same thing however as Lamak points wout if person_id coult be null in person_log then you will NOT get your desired results. NOT EXISTS could be faster than both of these methods. Plus this kind of is a duplicate of: http://stackoverflow.com/questions/173041/not-in-vs-not-exists – Matt Sep 21 '16 at 19:21
  • 1
    @Matt Your link is for SQL Server, and if I remember correctly in MySQL the `NOT EXISTS` solution is generally slower than the `LEFT JOIN` one – Lamak Sep 21 '16 at 19:23
  • @Bohemian I can see why but the funny thing is, I'm using Oracle SQL Developer to execute the two queries and query 1 (50~ sec) takes a lot longer than query 2 (1~ sec). – Sticky Sep 21 '16 at 20:35
  • @lamak I'm made the reasonable assumption that a table called `person_log` that has a column called `person_id` that is a foreign key to the `person` table has no nulls. It there are any nulls (a personless person log entry?), you get no results, which is hardly a subtle difference. Further, if there are nulls, you could add the trivial condition `where person_id is not null` to fix it. – Bohemian Sep 21 '16 at 23:41

4 Answers4

2

There is one bug in the above query 1. You cannot use "=" to check whether it is NULL

SELECT * FROM person p LEFT JOIN person_log pl
ON p.person_id = pl.person_id
WHERE pl.person_id = NULL
AND p.is_active = 1;

Ideally you should be using IS NULL

SELECT * FROM person p LEFT JOIN person_log pl
ON p.person_id = pl.person_id
WHERE pl.person_id IS NULL
AND p.is_active = 1;

You can review the NULL checks in the https://www.simple-talk.com/sql/t-sql-programming/how-to-get-nulls-horribly-wrong-in-sql-server/

Justin Samuel
  • 1,063
  • 4
  • 16
  • 30
  • Oh cool! Because of that change, now both queries return the same and the above answers make more sense. I wish I can have two answers. Maybe some admin can merge your answer with the accepted answer. (if they ever notice this) – Sticky Sep 21 '16 at 20:29
  • 1
    @Sticky - Nice to hear that your issue is resolved. As far as the comparison between the two approaches go it varies based on the situation and specifics like indexing etc etc. So I didnt comment on that part. – Justin Samuel Sep 21 '16 at 20:38
  • Oops, I missed the part where Sticky told us that the queries don't give the same result. (Well, a question like "why isn't the first query returning any rows" might have been more obvious.) Well spotted. – Thorsten Kettner Sep 21 '16 at 22:33
1

Both queries get you the same data.

The second query is the straight-forward way to the problem; get all persons that have no entry in person_log. You can do the same with a NOT EXISTS clause instead of a NOT IN clause. (NOT IN is a bit leaner, but the values you select in the subquery must not be null, for otherwise you see no data at all. I usually perfer IN / NOT IN over EXISTS / NOT EXISTS for their simplicity, but that's a matter of personal preference.

The first query is called an anti join. It is a trick to achieve the same as a NOT EXISTS or NOT IN query on weak database systems that don't implement these methods well. (The reason is that when a new database system is written, the programmers usually put all their effort in joins for they are so important and neglect EXISTS and IN for some time.)

It depends on the DBMS which gets executed fastest, NOT IN, NOT EXISTS or the anti join. The ideal DBMS would get to the same execution plan, no matter which syntax you choose.

The anti join can produce large intermediate results. With a mature DBMS you shouldn't use anti joins for this reason and for mere readability.

Thorsten Kettner
  • 89,309
  • 7
  • 49
  • 73
0

If you are looking for rows that don't match in the second table, especially if columns are not nullable then use NOT IN.

Ross Bush
  • 14,648
  • 2
  • 32
  • 55
  • This is wrong, if the columns are nullable, then the `NOT IN` shouldn't be used at all – Lamak Sep 21 '16 at 19:26
  • The optimizer (which only gets better over time) apparently pegs them as identical according to this frequently seen reference [Here](https://explainextended.com/2009/09/18/not-in-vs-not-exists-vs-left-join-is-null-mysql/). So I see no difference. Obviously queries and schemas vary. But one needs to be scientific about it, not guess. – Drew Sep 21 '16 at 19:34
  • I did not think that optimization could determine if a query was for only non existing values. I guess things are getting smarter. – Ross Bush Sep 21 '16 at 19:48
0

The first one gets all the persons that have no log entries and then filters out the inactive ones.

The second query gets all the persons. Then filters out the inactive. Then gets all the log entries. Then filters out the persons that have no log entries.

They will both return the same info. However, for performance reasons, queries like the second one should be avoided if using a JOIN is possible. The main benefit of the JOINs is indexes. Only one index in the WHERE will be used, but each JOIN will use one.

CptMisery
  • 612
  • 4
  • 15