Mystery query fail: Why did this create a massive output?

Question

I was attempting to do some basic Venn Diagram subtraction to compare a temp table to some live data, and see how they were different.

This query blew up to well north of 15 million returned rows, and I noticed it was duplicating (by 10,000x or more) a known unique field - indicating something went very wrong with my query (I mean by this that rows were being duplicated and I could verify this by this Globally Unique Identifier field). I was expecting to get at most 200 rows returned:

select a.*
    from TableOfLiveData a
        inner join #TempDataToBeSubtracted b
        on a.GUID <> b.guidTemp           --I suspect the issue is here   
    where {here was a limiting condition that should have reduced my live
           data to a "pre-join" count(*) of 20,000 at most...}

After I hit Execute the query ran much longer than expected and I could see that millions of rows were being returned before I had to cancel out.

Let me know what the obvious thing is!?!?

edit: FYI: If the where clause were not included, I would expect a VAST amount of rows returned...

I mean, you are most likely performing almost a cartesian product between both tables. Think about it, take one value of `GUID` from the first table, and you get a result for every row of the second table where `guidTemp` is not equal — Lamak, Aug 03 '16 at 21:02
@Lamak, The same thought occurred to me ... but it was an inner join?! — ColinMac, Aug 03 '16 at 21:04
An inner join is just a cross join with the join predicate evaluated on all rows. Take the cross join evaluate `a.GUID <> b.guidTemp` on all rows and keep the ones where it is true. — Martin Smith, Aug 03 '16 at 21:05
It is an `INNER JOIN`, where the condition to return a row is that the value of those `GUID`s are different. As I said, every row where those values are different is gonna be returned. It's the same as a `CROSS JOIN` with a condition — Lamak, Aug 03 '16 at 21:05
Should not the `JOIN` be a `NOT IN` subquery in the `WHERE` part instead? — Alejandro, Aug 03 '16 at 21:06
@Alejandro, I think I had it right according to my intention, which was to subtract everything in #TempData from the live table. Of course, the subsequent where clause was very important to trimming down the live table... — ColinMac, Aug 03 '16 at 21:13
@MartinSmith and Lamak, Sounds like you're both making the same point. May I ask why the rows were being duplicated? I understand that it should return all cases where the live data GUID was not in my #TempTableGuidList — ColinMac, Aug 03 '16 at 21:16
@ColinMac What you are doing is almost certainly not what you want. I don't know how else to explain it other than just take **one** value for `GUID` and compare it with every single value of `guidTemp`. Every time that you find that those values are different, then that row will fulfill your join condition so it will be in the result set. Now keep doing that for every single value of `GUID`, and then you'll know why you are getting so many rows as a result — Lamak, Aug 03 '16 at 21:20
This may or may not help http://stackoverflow.com/a/27458534/73226 — Martin Smith, Aug 03 '16 at 21:24
@MartinSmith, Regarding your first comment. I reran the query with Left Join in lieu of Inner Join and still had the Cartesian product issue? Obviously I don't understand Joins like I thought I did. I'll go study up and as a takeaway, will avoid using 'not equals' in the "on" clause of a join... — ColinMac, Aug 03 '16 at 21:46
a.GUID <> b.guidTemp --I suspect the issue is here I suspect it is there, too. You are asking for everything that doesn't match. Can you put in your where condition? or maybe you can use not in (subquery) to fix it. — Missy, Aug 03 '16 at 21:50
Please show the actual condition, or at least tell us the table(s) to which it applies — Bohemian, Aug 03 '16 at 23:19
@colin ok, but which tables are those columns in. It's important to know because it makes a big difference to the final query — Bohemian, Aug 04 '16 at 16:59
@bohemian `where TableofLiveData.field1 = 3 and TableofLiveData.field2 = 1515` This restricted the size of the live data by a factor of probably 1/10,000 — ColinMac, Aug 04 '16 at 18:04

score 1 · Answer 1 · answered Aug 03 '16 at 21:54

Thank you @Lamak and @MartinSmith for your comments that solved this problem.

By using a 'not equals' in my "on" clause, I ensured that I would be selecting every row in LiveTable that didn't have a GUID in my #TempTable, not just once as I intended, but for each entry in my #TempTable, multiplying my results by about 20,000 in this case (the cardinality of the #TempTable).

To fix this, I did a simple subquery on my #TempTable using the "Not In" Statement as recommended in the comments. This query finished in under a minute and returned under a 100 rows, which was much more in-line with my expectation:

select a.*
    from TableOfLiveData a
    where a.GUID not in (select b.guidTemp from #TempDataToBeSubtracted b)
    and {subsequent constraint statement not relevant to question}

score 1 · Accepted Answer · answered Aug 05 '16 at 00:30

Although your query is logically correct, the problem is you have a "Cartesian product" (n x m rows) in your join, but the where clause is executed after the join is made, so you have a colossal number of rows over which the where clause must be executed... so it will be very, very slow.

A better approach is to do an outer join on the key columns, but discard all successful joins by filtering for missed joins:

select a.*
from TableOfLiveData a
left join #TempDataToBeSubtracted b on b.guidTemp = a.GUID
where a.field1 = 3
and a.field2 = 1515
and b.guidTemp is null -- only returns rows that *don't* match

This works because when an outer join is missed, you still get the row from the main table and all columns in the joined table are null.

Creating an index on (field1, field2) will improve performance.

I'm curious, if I told you that the "Join on" GUIDs only accounted for 1/10000 of the total of TableOfLiveData, would this script be slow because it basically grabs ALL of TableOfLiveData BEFORE filtering on Field1, Field2, GUID is Null? — ColinMac, Aug 05 '16 at 19:25
@ColinMac You're right that it will first try to join all rows from TableOfLiveData, then throw out hits. But that process will be very quick. 15M is not a large number of rows, and the optimizer should be able to throw stuff out early. As long as there's an index on `#TempDataToBeSubtracted(guidTemp)` it should be very fast. — Bohemian, Aug 05 '16 at 20:30

Mystery query fail: Why did this create a massive output?

2 Answers2