NOT IN statement is slowing down my query

Question

I have a problem with my query. I have a simple example here that illustrates the code I have.

SELECT distinct ID 
FROM Table  
WHERE IteamNumber in (132,434,675) AND Year(DateCreated) = 2019
      AND ID NOT IN (
                     SELECT Distinct ID FROM Table  
                     WHERE IteamNumber in (132,434,675) AND DateCreated < '2019-01-01')

As you can see, I'm retrieving unique data id's that has been created in 2019 and not earlier.

The select statements works fine, but once I use the NOT IN statement, the query could easily go 1 minute plus.

My other question could this be related to the computer/server performance that is running the SQL Server for Microsoft Business Central? Because the same query worked perfectly after all even with the (NOT IN) statement, but that was in Microsoft dynamics C5 SQL Server.

So my question is there something wrong with my query or is it mainly a server issue?

UPDATE: here is a real example: this takes 25 seconds to retrieve 500 rows

Select count(distinct b.No_),'2014'
from [Line] c    
inner join [Header] a
on a.CollectionNo = c.CollectionNo
Inner join [Customer] b
on b.No_ = a.CustomerNo

where  c.No_ in('2101','2102','2103','2104','2105')
and year(Enrollmentdate)= 2014 
and(a.Resignationdate < '1754-01-01 00:00:00.000' OR a.Resignationdate >= '2014-12-31')


and NOT EXISTS(Select distinct x.No_
                 from [Line] c    
                 inner join [Header] a
                 on a.CollectionNo = c.CollectionNo
                 Inner join [Customer] x
                 on x.No_ = a.CustomerNo
                 where x.No_ = b.No_ and 
                       c.No_ in('2101','2102','2103','2104','2105')
                       and Enrollmentdate < '2014-01-01'
                       and(a.Resignationdate < '1754-01-01 00:00:00.000' OR a.Resignationdate > '2014-12-31'))

There is actually no way to answer that question without knowing more about your database e.g. how many rows, what indexes etc. To get actual performance help you really need to show us your execution plan. — Dale K, Aug 09 '19 at 07:36
Pay attention to this topic: [NOT IN vs NOT EXISTS](https://stackoverflow.com/questions/173041/not-in-vs-not-exists) — Denis Rubashkin, Aug 09 '19 at 07:36
In your first query, you are including only those IDs that have datecreated in year 2019. It will automatically not include any of the ID that have 'DateCreated < '2019-01-01'' So why do you need this subquery which is doing the same thing for you. — Ritika, Aug 09 '19 at 07:37
Replace `Year(DateCreated) = 2019` with `DateCreated >= '2019-01-01'` and ditch the `NOT IN`. — Salman A, Aug 09 '19 at 07:45
@Ritika, if ID is not unique, this is incorrect: `It will automatically not include any of the ID that have 'DateCreated < '2019-01-01` — Denis Rubashkin, Aug 09 '19 at 07:48

Salman A · Answer 1 · 2019-08-09T11:50:29.760

3

If I understand correctly you can write the query as a GROUP BY query with a HAVING clause:

SELECT ID 
FROM t
WHERE IteamNumber in (132, 434, 675)
GROUP BY ID
HAVING MIN(DateCreated) >= '20190101' -- no row earlier than 2019
AND    MIN(DateCreated) <  '20200101' -- at least one row less than 2020

This will remove rows for which an earlier record exists. You can further improve the performance by creating a covering index:

CREATE INDEX IX_t_0001 ON t (ID) INCLUDE (IteamNumber, DateCreated)

edited Aug 09 '19 at 11:50

answered Aug 09 '19 at 07:54

Salman A

262,204
82
430
521

Just the removal of the second scan on the OP's table could be of great benefit. You might need to add a `MAX` clause to the `HAVING` as well, to ensure that rows for 2020 aren't captured (even if just a form of future proofing) or so that the OP can see how to apply it to different years. – Thom A Aug 09 '19 at 08:12
Hi Salman, I just tried your soultion but i get no data retrieved back. The reason i use NOT in, because the customerID could have cancelled their subscription 3 years ago and then they could choosen to subscribe again in 2019 with ofc the same accountnumber which is id(in this example) – pancake Aug 11 '19 at 08:24
ALso isn't it a bad idea to use an index or force an index to a stored procedure and not let there sql server choose it? – pancake Aug 11 '19 at 09:07
@MishMish this query should generate results identical to your query. If it does not give you the desired results you could be using a different where clause in your production code. Also, I have merely suggested and index, if SQL Server finds it useful it will use it. There is no forcing. – Salman A Aug 11 '19 at 15:00
Hi salman, i added the original query above, not sure if it would work with a having statement – pancake Aug 12 '19 at 13:42

DarkRob · Answer 2 · 2019-08-09T08:05:43.710

0

Issue is because of your IN statement, it is preferred in my opinion to avoid any IN statement rather then this, create join with subquery and filter out your data using where clause.

In case of IN statement each record of your table mapped with all the records of subquery, which definitely slows down your process.

If it is mandatory to use IN clause then use it with index. Create proper index of your respected columns, which improve your performance.

Instead of IN you may use EXISTS to increase the performance of your query.

Example of EXISTS is :

SELECT distinct ID 
FROM Table AS T 
WHERE IteamNumber in (132,434,675) AND Year(DateCreated) = 2019
      AND NOT EXISTS (
                     SELECT Distinct ID FROM Table AS T2 
                     WHERE T1.ID=T2.ID 
                     AND IteamNumber in (132,434,675) AND DateCreated < '2019-01-01' )

edited Aug 09 '19 at 08:05

answered Aug 09 '19 at 07:34

DarkRob

3,843
1
10
27

3

A well designed database should be able to handle an `in` statement though. Like for any database performance, we need to know the bigger picture to provide useful advice. – Dale K Aug 09 '19 at 07:37
Thank you darkrob, could you please demonstrate in code with my example how you would do it with join or exists? – pancake Aug 09 '19 at 07:42
I find it weird that it can't handle the Not in statement, it's not that many records we're talking about and basically Dynamics business central we're talking about here – pancake Aug 09 '19 at 07:46
*" it is preferred in my opinion to avoid any IN statement rather then this"* is a VERY sweeping statement. Are you saying that instead of `WHERE T.ID IN (1,2)` you would do `JOIN (VALUES(1),(2)) V(I) ON T.ID = V.I`? `IN` can be just as performant as a `JOIN` in many circumstances, and `IN` and `JOIN` are not synonyms; a `JOIN` will return all relevant rows possibly creating duplicates, where as an `IN` or `EXISTS` (which i tend to use) doesn't. They are *can* be functionally similar, but they aren't the same. – Thom A Aug 09 '19 at 08:07

Marc Guillot · Accepted Answer · 2019-08-19T08:33:05.157

I usually prefer JOINs than INs, you can get the same result but the engine tends be able to optimize it better.

You join your main query (T1) with what was the IN subquery (T2), and you filter that T2.ID is null, ensuring that you haven't found any record matching those conditions.

SELECT distinct T1.ID 
FROM Table T1 
     LEFT JOIN Table T2 on T2.ID = T1.ID AND 
                     T2.IteamNumber in (132,434,675) AND T2.DateCreated < '2019-01-01'
WHERE T1.IteamNumber in (132,434,675) AND Year(T1.DateCreated) = 2019 AND
      T2.ID is null

UPDATE: Here is the proposal updated with your real query. Since your subquery has inner joins, I have created a CTE so you can left join that subquery. The functioning is the same, you left join your main query with the subquery and you return only the rows with no matching records found on the subquery.

with previous as (
  Select x.No_
  from [Line] c    
       inner join [Header] a on a.CollectionNo = c.CollectionNo
       inner join [Customer] x on x.No_ = a.CustomerNo
  where     c.No_ in ('2101','2102','2103','2104','2105')
        and Enrollmentdate < '2014-01-01'
        and (a.Resignationdate < '1754-01-01 00:00:00.000' OR a.Resignationdate > '2014-12-31'))
)
Select count(distinct b.No_),'2014'
from [Line] c    
     inner join [Header] a on a.CollectionNo = c.CollectionNo
     inner join [Customer] b on b.No_ = a.CustomerNo
     left join previous p on p.No_ = b.No_
where    c.No_ in ('2101','2102','2103','2104','2105')
     and year(Enrollmentdate)= 2014 
     and (a.Resignationdate < '1754-01-01 00:00:00.000' OR a.Resignationdate >= '2014-12-31')
     and p.No_ is null

And one more question, do you know if in statement performs better with integers in it than nvarchar/varchar values? — pancake, Aug 12 '19 at 08:13
@MishMish, if the field is indexed I don't think that it would matter much if it's an integer or an string, but if it's not indexed then yes, the engine would compare integers faster than strings. — Marc Guillot, Aug 12 '19 at 09:20
Is it simple to do an index for the fields? it's in a stored procedure and i believe that it's bad to have index implemented into a stored procedure — pancake, Aug 12 '19 at 10:25
I just edited my question with the actual code, could you please show me how it's done with so multiple joins — pancake, Aug 12 '19 at 10:48
@MishMish, I have updated the answer, you now need to use a CTE so you can have multiple joins on the subquery. — Marc Guillot, Aug 13 '19 at 09:46

NOT IN statement is slowing down my query

3 Answers3