91

I have a query doing something like:

SELECT FieldX, FieldY FROM A
WHERE FieldW IN (108, 109, 113, 138, 146, 160,
307, 314, 370, 371, 441, 454 ,457, 458, 479, 480,
485, 488, 490, 492, 519, 523, 525, 534, 539, 543,
546, 547, 550, 564, 573, 629, 642, 643, 649, 650,
651, 694, 698, 699, 761, 762, 768, 772, 773, 774,
775, 778, 784, 843, 844, 848, 851, 852, 853, 854,
855, 856, 857, 858, 859, 860, 861, 862, 863, 864,
865, 868, 869, 871, 872, 873, 891) 

Having an IN clause with so many options, is it bad for query performance? I'm experiencing many timeouts in my application, and I believe it could be a source of this kind of problem. Can I optimize the query without removing the numbers, using any good SQL Hint?

EDIT:

@KM these are keys in a different table. This is a forum application, explaining briefly: c# gets all forums from database and stores it in app cache. Before C# calls a procedure that gets the threads for these forums and for this user, c# does some logic filtering the "all forums" collection, considering permissions and some business logic. The timeout happens on database and not on application itself. Doing all this logic on the query will require a lot of inner joins and I'm not 100% sure I can do all this inside the procedure.

I'm using SQL Server 2000

NoWar
  • 36,338
  • 80
  • 323
  • 498
Victor Rodrigues
  • 11,353
  • 23
  • 75
  • 107
  • 4
    are these random numbers? or do they come from somewhere? maybe you can make a seperate table and put these numbers in there and index it, then use a join in your select statement. – Stan R. Jun 18 '09 at 16:43
  • No, they aren't random numbers. There are some queries executed before (and then cached) to return these ids. Making a join is rather difficult, because of some application logic being done today in C# after doing that (cached) query. – Victor Rodrigues Jun 18 '09 at 16:48
  • you don't give much info, but if these (IN n, n,n ..) values are keys in a different table, and have something in common (status=xyz), it may be possible to INNER JOIN to that table which could be faster. If you are selecting all these IDs (in a different query) based on some condition and then building this select statement from that result set, then you should try to use the INNER JOIN method instead – KM. Jun 18 '09 at 16:53
  • From SQL server's perspective, however, they are random, and it has to query each individually. With no pattern I don't know that you can write a good hint. You could try multithreading - break the numbers up into smaller clumps and send more queries with smaller lists. – quillbreaker Jun 18 '09 at 16:56
  • I'm not sure what you mean by "cached". I suspect something unusual may be causing locking to take place, and the queries are timing out on locks. – dkretz Jun 18 '09 at 17:07
  • @OP, tell us if you have an index on the field, if you have, I must assume the reason of the timeouts must be searched in another place... – tekBlues Jun 18 '09 at 17:12
  • We're getting an amazing amount of guesses and suggestions here that will imply a lot of restructuring and testing with little basic evidence to go on. I suggest you take a break and see if we can get some basic information first. Are you sure this is the timeout query? Is it locking, or is it working the entire time? Can we see a query plan? For starters. – dkretz Jun 18 '09 at 17:40
  • @le dorfier, No, I'm not sure this is the timeout query, it is just a piece of the larger query, it is passed as a parameter (NText) to the real procedure to run it among others. Since there are no obvious SQL heavy calls, I'm trying to optimize some queries. Because of this I've questioned if this IN could be one more reason for slowing down the query. – Victor Rodrigues Jun 18 '09 at 19:02
  • From what you've described so far, I doubt it's a good place to be spending much time improving the situation. I don't think any of this will be much help without more information about the larger context, and is likely just to make it all more complicated. We could work it through, but in this thread it will just get lost in the noise of all the guesses. Sorry. I'd like to help, because this is one of the more interesting situations that's been posted. – dkretz Jun 18 '09 at 20:20
  • Similar Question: http://stackoverflow.com/questions/761150/how-does-the-in-predicate-work-in-sql – Gavin Miller Jun 18 '09 at 20:42
  • Thanks, now can you set the following: SET STATISTICS IO ON and SET STATISTICS TIME ON and run the same query and post what is displayed for statistics. – dkretz Jun 18 '09 at 22:48
  • @Victor, you say this is a forum application. Is there a url where I can sign up and log on and discuss the details of this? Your description of your design leaves a bunch a questions in my mind. – dkretz Jun 18 '09 at 23:00
  • If these are "keys in a different table" then it sounds like you're pulling them out with one query to construct another query rather than just joining things together. If that's true, it's undoubtedly unproductive. But I may be misunderstanding. – dkretz Jun 18 '09 at 23:02

14 Answers14

186

There are several considerations when writing a query using the IN operator that can have an effect on performance.

First, IN clauses are generally internally rewritten by most databases to use the OR logical connective. So col IN ('a','b','c') is rewritten to: (COL = 'a') OR (COL = 'b') or (COL = 'c'). The execution plan for both queries will likely be equivalent assuming that you have an index on col.

Second, when using either IN or OR with a variable number of arguments, you are causing the database to have to re-parse the query and rebuild an execution plan each time the arguments change. Building the execution plan for a query can be an expensive step. Most databases cache the execution plans for the queries they run using the EXACT query text as a key. If you execute a similar query but with different argument values in the predicate - you will most likely cause the database to spend a significant amount of time parsing and building execution plans. This is why bind variables are strongly recommended as a way to ensure optimal query performance.

Third, many database have a limit on the complexity of queries they can execute - one of those limits is the number of logical connectives that can be included in the predicate. In your case, a few dozen values are unlikely to reach the built-in limit of the database, but if you expect to pass hundreds or thousands of value to an IN clause - it can definitely happen. In which case the database will simply cancel the query request.

Fourth, queries that include IN and OR in the predicate cannot always be optimally rewritten in a parallel environment. There are various cases where parallel server optimization do not get applied - MSDN has a decent introduction to optimizing queries for parallelism. Generally though, queries that use the UNION ALL operator are trivially parrallelizable in most databases - and are preferred to logical connectives (like OR and IN) when possible.

zcoop98
  • 2,590
  • 1
  • 18
  • 31
LBushkin
  • 129,300
  • 32
  • 216
  • 265
  • 1
    Thanks for this comment this is something everyone needs to know. I ran into an In statement that was being passed 25k items got to love legecy code. :D – vikingben Apr 04 '13 at 20:31
  • 1
    "The execution plan for both queries will likely be equivalent assuming that you have an index on col." So if the column is unindexed there's a performance difference between the syntaxes? Which one performs better on unindexed column? Why is that? – jumxozizi Sep 22 '16 at 15:36
  • Can the order of the IN parameter have an effect on performance. For exemple, should I sort numbers before writing them in the IN close? – Maxter Dec 03 '19 at 16:27
  • if the column is having some kind of range - then there is some good explanation here , https://dba.stackexchange.com/questions/207255/performance-mysql-id-vs-id-in-multiple-values – Ashish Shetkar Dec 24 '20 at 08:45
  • 2
    What if we use an inner join instead of IN operator. Will it improve the performance? – zulqadar idrishi Mar 20 '21 at 05:26
7

You can try creating a temporary table, insert your values to it and use the table instead in the IN predicate.

AFAIK, SQL Server 2000 cannot build a hash table of the set of constants, which deprives the optimizer of possibility to use a HASH SEMI JOIN.

This will help only if you don't have an index on FieldW (which you should have).

You can also try to include your FieldX and FieldY columns into the index:

CREATE INDEX ix_a_wxy ON a (FieldW, FieldX, FieldY)

so that the query could be served only by using the index.

SQL Server 2000 lacks INCLUDE option for CREATE INDEX and this may degrade DML performance a little but improve the query performance.

Update:

From your execution plan I see than you need a composite index on (SettingsID, SectionID)

SQL Server 2000 indeed can built a hash table out of a constant list (and does it), but Hash Semi Join most probably will be less efficient than a Nested Loop for query query.

And just a side note: if you need to know the count of rows satisfying the WHERE condition, don't use COUNT(column), use COUNT(*) instead.

A COUNT(column) does not count the rows for which the column value is NULL.

This means that, first, you can get the results you didn't expect, and, second, the optimizer will need to do an extra Key Lookup / Bookmark Lookup if your column is not covered by an index that serves the WHERE condition.

Since ThreadId seems to be a CLUSTERED PRIMARY KEY, it's all right for this very query, but try to avoid it in general.

Quassnoi
  • 413,100
  • 91
  • 616
  • 614
  • 1
    I would love to see someone test this assertion, ie testing performance of the IN vs creating a temporary table and joining... – tekBlues Jun 18 '09 at 16:57
  • @tekBlues: don't have a 2000 handy, sorry. 2005 builds a hash table over the IN clause values all right, using a CONSTANT SCAN method. Could you please build the execution plan for your query and post it here? – Quassnoi Jun 18 '09 at 16:59
  • @tekBlues: oh, sorry, did't notice you're not the @op. Nevermind :) – Quassnoi Jun 18 '09 at 16:59
  • Heck of a lot of work just to identify which queries are timing out. Why not just look at a trace? – dkretz Jun 18 '09 at 17:01
  • @le dorfier: I'd like to, but I don't have a 2000 installed. – Quassnoi Jun 18 '09 at 17:02
  • @Q, Suggestion to OP, since that's the problem trying to solve. – dkretz Jun 18 '09 at 17:03
  • Yes, just tested and it use a Clustered Index Scan, no mattering the number of values on the IN, SQL 2000. – tekBlues Jun 18 '09 at 17:10
  • @tekBlues: Clustered Index Scan is a full table scan. Could you please see what join method does it use? – Quassnoi Jun 18 '09 at 17:19
  • 1
    @tekBlues - we have used the approach of creating a temporary table for this exact purpose. In order to be performant, you must insert rows into the temp table and then update statistics on the temp table - otherwise the SQL execution plan optimizer thinks the temp table is empty and will perform a full scan. This approach can indeed be faster than a query using IN or OR clauses and because it can avoid the steps of parsing the query and building an execution plan. – LBushkin Jun 18 '09 at 17:33
  • Creating this temp table, inserting N times in it, updating its statistics and after joining the temp table with my query, could all this be faster than the IN clause? – Victor Rodrigues Jun 18 '09 at 18:52
  • @Victor: Could you please post the current execution plan for your query? – Quassnoi Jun 18 '09 at 21:59
5

If you have a good index on FieldW, using that IN is perfectly right.

I have just tested and SQL 2000 does a Clustered Index Scan when using the IN.

tekBlues
  • 5,745
  • 1
  • 29
  • 32
  • 2
    Then that wouldn't necessarily be a good thing. It should be doing lookups rather than a scan, suggesting that using IN isn't "perfectly right". But size of table, cardinality, and other factors are also important. – dkretz Jun 18 '09 at 17:15
  • 1
    @tekBlues: could you please see if it does Hash Match over the Constant Scan? Just add an OPTION (HASH JOIN) to end of the query and see the plan – Quassnoi Jun 18 '09 at 17:16
  • @Quassnoi, I added the HASH JOIN and the execution plan doesn't change. – tekBlues Jun 18 '09 at 17:19
  • 1
    Ideally you would like your query to search the index specifically for each value, rather than reading through the entire index from beginning to end ("scan"). If there is just one key, or the keys is an ordered set, it's more likely to do that. But if the keys aren't in a table (ordered, or easily orderable) the query optimizer may do something suboptimal. Also, it's clear that you've got those values in the (single possible) clustered index, which may or may not be true for OP, and may or may not even be significant. – dkretz Jun 18 '09 at 17:27
  • @tekBlues: does the Clustered Table Scan have an arrow pointing to the box with a join / semijoin method used: Nested Loops, Merge or Hash? – Quassnoi Jun 18 '09 at 17:27
  • There's no likelihood that @tekBlues table is the same size, schema, cardinality, indexing, etc. They query plans are likely entirely different. – dkretz Jun 18 '09 at 17:35
  • @le dorfier: that's why I asked to force the join method – Quassnoi Jun 18 '09 at 17:42
  • Then you're solving @tekBlues problem, not OP's problem which may be entirely different. Doesn't it seem to you that this question has gone off the rails? – dkretz Jun 18 '09 at 17:44
  • @le dirfier: what I want to see is whether SQL Server 2000 is capable of building a hash table over a set of constants or not. It's crucial for understanding the problem, since browsing 73 values in a loop is far less efficient than scanning a hash table. – Quassnoi Jun 18 '09 at 17:46
  • @le dorfier: the @op ask to improve this query, and I have the strong feeling it can be improved. You're right, there may (or may not) be other issues with his system, but the query can be improved too. – Quassnoi Jun 18 '09 at 17:49
  • 2
    The actual problem is that the application is timing out, possibly on this query, maybe from slowness, maybe from locking. That's a long ways from the abiilty to build a hash table. Wouldn't you at least like to see a query plan first? The query is only worth improving once we know it's a problem. – dkretz Jun 18 '09 at 17:55
  • @Quassnoi and all you guys: I've posted the Estimated Execution Plan for the query on the main post. – Victor Rodrigues Jun 18 '09 at 20:41
  • "I have just tested and SQL 2000 does a Clustered Index Scan when using the IN." Did you try/compare `FieldW=108 or FieldW=109 or FieldW=113...`? – jumxozizi Sep 22 '16 at 15:41
5

Depending on your data distribution, additional predicates in your WHERE clause may improve performance. For example, if the set of ids is small relative to the total number in the table, and you know that the ids are relatively close together (perhaps they will usually be recent additions, and therefore clustered at the high end of the range), you could try and include the predicate "AND FieldW BETWEEN 109 AND 891" (after determining the min & max id in your set in the C# code). It may be that doing a range scan on those columns (if indexed) works faster than what is currently being used.

Steve Broberg
  • 4,255
  • 3
  • 28
  • 40
3

There are better ways to code it, but I doubt it's the cause of your timeouts, especially if it's only a SELECT. You should be able to determine that by looking at your query traces though. But recoding this would be optimization by guessing, and an unlikely guess at that.

Let's start with a query plan for the query that is actually timing out. Do you know for sure which query it is?

dkretz
  • 37,399
  • 13
  • 80
  • 138
2

IN is exactly the same thing as writing a big list of ORs. And OR often makes queries unSARGable, so your indexes may be ignored and the plan goes for a full scan.

Remus Rusanu
  • 288,378
  • 40
  • 442
  • 569
1

Typically the IN clause is harmful to performance, but what is "bad" depends on the application, data, database size, etc. You need to test your own app to see what is best.

Bryan Migliorisi
  • 8,982
  • 4
  • 34
  • 47
  • Hi Bryan, what do you mean for "harmful for performance" The scenario is that I want to filter certain values for a field. What's the best way to do it? IMHO is using the IN clause – tekBlues Jun 18 '09 at 16:47
1

Basically what that where clause does is "FieldW = 108 OR FieldW = 109 OR FieldW = 113...". Sometimes you can get better performance by doing multiple selects, and combining them with union. For example:

SELECT FieldX, FieldY FROM A WHERE FieldW = 108
UNION ALL
SELECT FieldX, FieldY FROM A WHERE FieldW = 109

But of course that is impractical when you're comparing to so many values.

Another option might be to insert those values into a temporary table and then joining the A table to that temp table.

Tommi
  • 8,550
  • 5
  • 32
  • 51
  • 1
    I would be wary of using UNION statements, especially in this circumstance. Effectively, a UNION it performs the equivalent of a SELECT DISTINCT on the final result set. In other words, UNION takes the results of two like recordsets, combines them, and then performs a SELECT DISTINCT in order to eliminate any duplicate rows. In other words, you'd be running an exponential number of SELECTS under the covers. – Adam McKee Jun 18 '09 at 16:56
1

Performance can only be judged in the context of what you are trying to do. In this case you are requesting the retrieval of around 70 rows (assuming thay are unique values), so you can expect something like 70 times the duration of retrieving a single value. It might be less due to caching, or course.

However, the query optimiser may need or choose to perform a full table scan in order to retrieve the values, in which case performace will be little different than retrieving a single value via the same access plan.

David Aldridge
  • 51,479
  • 8
  • 68
  • 96
1

I generally would use a user-defined table type for queries like this.

CREATE TYPE [dbo].[udt_int] AS TABLE (
    [id] [int] NOT NULL
)

Using a table-variable and filling it with rows for each of your numbers, you can do:

SELECT 
    FieldX, 
    FieldY
FROM A
INNER JOIN @myIds B ON
    A.FieldW = B.id
Donald.Record
  • 301
  • 3
  • 7
  • 1
    This will give you worse performance. With 100 Identities in an @myIds table and a heavy query that takes 5 seconds to execute using IN (select id from @myIds) degrades to 12 seconds using an INNER JOIN in SqlServer 13 – Patrik Melander Dec 05 '19 at 17:37
0

the size of your table will determine the speed when using this statement. If it's not a very large table...this statement isn't affecting your performance.

Eric
  • 7,930
  • 17
  • 96
  • 128
0

Here is your answer...

https://web.archive.org/web/20211020153409/https://www.4guysfromrolla.com/webtech/031004-1.shtml

Basically, you want to create a function that will split a string and populate a temp table with the split contents. Then you can join to that temp table and manipulate your data. The above explains things pretty well. I use this technique a lot.

In your specific case use a join to the temp table instead of an in clause, much faster.

infocyde
  • 4,140
  • 2
  • 25
  • 36
0

You might try something like:

select a.FieldX, a.FieldY
from (
    select FieldW = 108 union
    select FieldW = 109 union
    select FieldW = 113 union
    ...
    select FieldW = 891
) _a
join A a on a.FieldW = _a.FieldW

It may be appropriate for your situation, such as when you want to generate a single SQL statement dynamically. On my machine (SQL Server 2008 Express), testing with a small number (5) of FieldW values and a large number (100,000) of rows in A, this uses an index seek on A with a nested loops join between A and _a, which is probably what you're looking for.

yfeldblum
  • 65,165
  • 12
  • 129
  • 169
-1

If you can use other things than IN : do it (I was using IN in some case not really the good way : I can easily replace with exist and it is quicker)

In your case : It seems not so bad.

Hugues Van Landeghem
  • 6,755
  • 3
  • 34
  • 59