41

I'm doing my best lately to look for the best way to run certain queries in SQL that could potentially be done multiple different ways. Among my research I've come across quite a lot of hate for the WHERE IN concept, due to an inherent inefficiency in how it works.

eg: WHERE Col IN (val1, val2, val3)

In my current project, I'm doing an UPDATE on a large set of data and am wondering which of the following is more efficient: (or whether a better option exists)

UPDATE table1 SET somecolumn = 'someVal' WHERE ID IN (id1, id2, id3 ....);

In the above, the list of ID's can be up to 1.5k ID's.

VS

Looping through all ID's in code, and running the following statement for each:

UPDATE table1 SET somecolumn = 'someVal' WHERE ID = 'theID';

To myself, it seems more logical that the former would work better / faster, because there's less queries to run. That said, I'm not 100% familiar with the in's and out's of SQL and how query queueing works.

I'm also unsure as to which would be friendlier on the DB as far as table locks and other general performance.

General info in case it helps, I'm using Microsoft SQL Server 2014, and the primary development language is C#.

Any help is much appreciated.

EDIT:

Option 3:

UPDATE table1 SET somecolumn = 'someVal' WHERE ID IN (SELECT ID FROM @definedTable);

In the above, @definedTable is a SQL 'User Defined Table Type', where the data inside comes through to a stored procedure as (in C#) type SqlDbType.Structured

People are asking how the ID's come in: ID's are in a List<string>in the code, and are used for other things in the code before then being sent to a stored procedure. Currently, the ID's are coming into the stored procedure as a 'User-Defined Table Type' with only one column (ID's).

I thought having them in a table might be better than having the code concatenate a massive string and just spitting it into the SP as a variable that looks like id1, id2, id3, id4 etc

Sean Missingham
  • 926
  • 1
  • 7
  • 19
  • 1
    Did you try to look at execution plan if which is faster of the two queries ? – japzdivino Oct 19 '15 at 01:09
  • 1
    Where do id1,id2,id3 are coming from? In most practical cases they come from another table, as a result of filtering on some condition. In this case you'd better off joining with that table, to get decent performance. – Andrew Savinykh Oct 19 '15 at 01:10
  • 1
    How are you passing these IDs in to SQL? Do you have a list of values in your C# code or do you get them as a result of another SQL query? – DavidG Oct 19 '15 at 01:10
  • 4
    For 1.5k IDs, it might be worth storing them in a (temporary) table first, then do `WHERE ID IN (SELECT ID FROM temptable)` – Colonel Thirty Two Oct 19 '15 at 01:10
  • I attempted to test it myself using SQL Profiler and looking at general execution time run in SQL Management studio but because it's happening through a stored procedure in my case I didn't find the results I wanted. I think I'm going to have to make a little script to run it outside of the SP and watch it in Profiler. All that said, I'm still very new to MS SQL tools so it's a slow process for me at the moment haha – Sean Missingham Oct 19 '15 at 01:11
  • 1
    @SeanMissingham you should be able to see it in SQL Profiler regardless whether it's in or out of a stored procedure. Perhaps, you did not set up the trace correctly – Andrew Savinykh Oct 19 '15 at 01:15
  • @zespri Very likely that setup isn't correct but that's great to know. I'll go down that path instead of writing a dedicated script just to test this. – Sean Missingham Oct 19 '15 at 01:16
  • 1
    @SeanMissingham dedicated script is great for showing others, profiler is usually faster for finding the problem yourself. – Andrew Savinykh Oct 19 '15 at 01:18
  • @zespri & DavidG I've made an EDIT to the main post to answer your question. Its actually what Colonel Thirty Two suggested. I didn't think to differ between the SELECT FROM Table VS (list of ID's) – Sean Missingham Oct 19 '15 at 01:18
  • 2
    Having them in a table is definitely better. You can only have about 2000 parameters, and once you have more than that, your sp would break. But no such problem with UDT. – Andrew Savinykh Oct 19 '15 at 01:21
  • Wow so you reckon (edited) option 3 is best using UDT? To a newbie I don't really see how having them as rows in a table is any faster than having them in a list, hence why I didn't distinguish in the original post – Sean Missingham Oct 19 '15 at 01:23

4 Answers4

31

I'm using your third option and it works great.

My stored procedure has a table-valued parameter. See also Use Table-Valued Parameters.

In the procedure there is one statement, no loops, like you said:

UPDATE table1 SET somecolumn = 'someVal' WHERE ID IN (SELECT ID FROM @definedTable);

It is better to call the procedure once, than 1,500 times. It is better to have one transaction, than 1,500 transactions.

If the number of rows in the @definedTable goes above, say, 10K, I'd consider splitting it in batches of 10K.


Your first variant is OK for few values in the IN clause, but when you get to really high numbers (60K+) you can see something like this, as shown in this answer:

Msg 8623, Level 16, State 1, Line 1 The query processor ran out of internal resources and could not produce a query plan. This is a rare event and only expected for extremely complex queries or queries that reference a very large number of tables or partitions. Please simplify the query. If you believe you have received this message in error, contact Customer Support Services for more information.

Community
  • 1
  • 1
Vladimir Baranov
  • 31,799
  • 5
  • 53
  • 90
  • "If the number of rows in the @definedTable goes above, say, 10K, I'd consider splitting it in batches of 10K." How to split it ? – Black Apr 15 '19 at 10:18
  • 1
    @Black, If I had, say, 30K rows to update, I'd call the stored procedure three times and pass 10K IDs to it during each call in the table-valued parameter. Actually, nowadays with modern hardware and networks I'd start bothering with batches maybe after 100K or more. – Vladimir Baranov Apr 15 '19 at 13:57
2

Your first or third options are the best way to go. For either of them, you want an index on table1(id).

In general, it is better to run one query rather than multiple queries because the overhead of passing data in and out of the database adds up. In addition, each update starts a transactions and commits it -- more overhead. That said, this will probably not be important unless you are updating thousands of records. The overhead is measured in hundreds of microseconds or milliseconds, on a typical system.

Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
1

You should definitely NOT use a loop and send an entire new SQL statement for each ID. In that case, the SQL engine has to recompile the SQL statement and come up with an execution plan, etc. every single time.

Probably the best thing to do is to make a prepared statement with a placeholder then loop through your data executing the statement for each value. Then the statement stays in the database engine's memory and it quickly just executes it with the new value each time you call it rather than start from scratch.

If you have a large database and/or run this often, also make sure you create an index on that ID value, otherwise it will have to do a full table scan with every value.

EDIT:

Perl pseudocode as described below:

#!/usr/bin/perl
use DBI;
$dbh = DBI->connect('dbi:Oracle:MY_DB', 'scott', 'tiger', { RaiseError => 1, PrintError =>1, AutoCommit => 0 });
$sth = $dbh->prepare ("UPDATE table1 SET somecolumn = ? WHERE id = ?");
foreach $tuple (@updatetuples) {
    $sth->execute($$tuple[1], $$tuple[0]);
}
$dbh->commit;
$sth->finish;
$dbh->disconnect;
exit (0);
Laserbeak
  • 78
  • 5
  • I thought the data was coming from your program, not another table. IT's not entirely clear. But in that case, I still think this is the best route. I don't know C#, so here is some untested Perl pseudocode: Oh, I guess I have to put that in a answer. – Laserbeak Oct 19 '15 at 11:20
1

I came upon this post when trying to solve a very similar problem so thought I'd share what I found. My answer uses the case keyword, and applies to when you are trying to run an update for a list of key-value pairs (not when you are trying to update a bunch of rows to a single value). Normally I'd just run an update query and join the relevant tables, but I am using SQLite rather than MySQL and SQLite doesn't support joined update queries as well as MySQL. You can do something like this:

UPDATE mytable SET somefield=( CASE WHEN (id=100) THEN 'some value 1' WHEN (id=101) THEN 'some value 2' END ) WHERE id IN (100,101);

Joey Rich
  • 391
  • 3
  • 7