16

I have a SQL Table and I would like to select multiple rows by ID. For example I would like to get the row with IDs 1, 5 and 9 from my table.

I have been doing this with a WHERE IN statement similar to below:

SELECT [Id]
FROM [MyTable]
WHERE [Id] IN (1,5,9)

However this is quite slow for large numbers of items in the 'IN' clause

Below is some performance data from selecting rows using where in from a table with 1,000,000 rows

Querying for 1 random keys (where in) took 0ms
Querying for 1000 random keys (where in) took 46ms
Querying for 2000 random keys (where in) took 94ms
Querying for 3000 random keys (where in) took 249ms
Querying for 4000 random keys (where in) took 316ms
Querying for 5000 random keys (where in) took 391ms
Querying for 6000 random keys (where in) took 466ms
Querying for 7000 random keys (where in) took 552ms
Querying for 8000 random keys (where in) took 644ms
Querying for 9000 random keys (where in) took 743ms
Querying for 10000 random keys (where in) took 853ms

Is there a faster way than using WHERE IN to do this.

We cant do a join as this is between disconnected systems.

I have heard an in memory temp table joined to the data in MYSQL may be faster but from my research MSSQL doesn't have have an in memory table option and even so wouldn't it be prone to exactly the same index scan on insert into the temp table as the WHERE IN has?

EDIT:

This table has ID as a PK so has the default PK index, cf

CREATE TABLE [dbo].[Entities](
    [Id] [int] IDENTITY(1,1) NOT NULL,
 CONSTRAINT [PK_dbo.Entities] PRIMARY KEY CLUSTERED 
(
    [Id] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]

Execution plan

enter image description here

Here is a GIST for a console app which produces these performance results https://gist.github.com/lukemcgregor/5914774

EDIT 2 I created a function which creates a temp table from a comma separated string, and then joined vs that table. Its faster but i think mostly because of the issue with parsing the query with where in

Querying for 1 random keys took 1ms
Querying for 1000 random keys took 34ms
Querying for 2000 random keys took 69ms
Querying for 3000 random keys took 111ms
Querying for 4000 random keys took 143ms
Querying for 5000 random keys took 182ms
Querying for 6000 random keys took 224ms
Querying for 7000 random keys took 271ms
Querying for 8000 random keys took 315ms
Querying for 9000 random keys took 361ms
Querying for 10000 random keys took 411ms
Community
  • 1
  • 1
undefined
  • 33,537
  • 22
  • 129
  • 198
  • You do have an index on Id right? – Dale M Jul 03 '13 at 00:25
  • As Dale M. suggests, an index on Id is pretty much the first thing you need. Second, take a look at the query plan and verify that it's only touching the index, not the underlying table or even worse, does a table scan on the underlying table. – Timo Geusch Jul 03 '13 at 00:32
  • I second the two comments above, however it's hard to tell what you're trying to do. Maybe if you give a broader picture people will be able to make more specific suggestions. – Vlad G. Jul 03 '13 at 00:52
  • @dalem yes it should be indexed see edits – undefined Jul 03 '13 at 01:09
  • @TimoGeusch ive added in the execution plan its 100% in the index – undefined Jul 03 '13 at 01:10
  • Are you really just selecting ID and does your table really only have ID? The reason I ask is, the way you described it, this is probably the best you can get. However depending on the answer to the question above there may be a faster way. Also how many IDs are you typically querying at a time? – Vlad G. Jul 03 '13 at 01:15
  • Also, is there a pattern in the ID values or are they completely random? – Vlad G. Jul 03 '13 at 01:17
  • @VladG. in my test application yes the table only has ID, in the real world I am wanting more data. I am selecting batches of between 100 and 5000 rows in normal operation of these queries. The IDs in the above test are entirely random and ive included the code i used to produce these numbers in a gist. – undefined Jul 03 '13 at 01:34

3 Answers3

11

OK so I got it going really fast by defining a table type and then passing that type directly into the query and joining onto it.

in SQL

CREATE TYPE [dbo].[IntTable] AS TABLE(
    [value] [int] NULL
)

in code

DataTable dataTable = new DataTable("mythang");
dataTable.Columns.Add("value", typeof(Int32));

toSelect.ToList().ForEach(selectItem => dataTable.Rows.Add(selectItem));

using (SqlCommand command = new SqlCommand(
    @"SELECT * 
    FROM [dbo].[Entities] e 
    INNER JOIN @ids on e.id = value", con))
{
    var parameter = command.Parameters.AddWithValue("@ids", dataTable);
    parameter.SqlDbType = System.Data.SqlDbType.Structured;
    parameter.TypeName = "IntTable";

    using (SqlDataReader reader = command.ExecuteReader())
    {
        while (reader.Read())
        {
            results.Add(reader.GetInt32(0));
        }
    }
}

this produces the following results

Querying for 1 random keys (passed in table value) took 2ms
Querying for 1000 random keys (passed in table value) took 3ms
Querying for 2000 random keys (passed in table value) took 4ms
Querying for 3000 random keys (passed in table value) took 6ms
Querying for 4000 random keys (passed in table value) took 8ms
Querying for 5000 random keys (passed in table value) took 9ms
Querying for 6000 random keys (passed in table value) took 11ms
Querying for 7000 random keys (passed in table value) took 13ms
Querying for 8000 random keys (passed in table value) took 17ms
Querying for 9000 random keys (passed in table value) took 16ms
Querying for 10000 random keys (passed in table value) took 18ms
undefined
  • 33,537
  • 22
  • 129
  • 198
  • Nice work. Before you posted this i thought you were testing it in T-SQL. This is a nice alternative to using BCP. Good luck – andre.barata Jul 03 '13 at 04:33
  • 1
    Nice! In the real app depending on how wide your table will be the bottleneck may be the random IO generated by the random ID seeks. SQL Server may insert a Sort on the TVP before loop-joining it to Entities to minimize the randomness. You can avoid the expense of that sort if you define a clustered PK in your TVP (on ID). I would also experiment with data compression for Entities to minimize page count and improve buffer pool efficiency (again, if row size warrants that). Compression is only available in Enterprise Edition. – Vlad G. Jul 03 '13 at 04:33
  • If I'm understanding this, what you're doing is writing the keys you want to a temporary table, joining that table to the original from which you want the rows, and then reading out the join table? – Narfanator Oct 04 '18 at 19:20
3

I guess if you joined your table with a memory table indexed by a primary key, such as:

declare @tbl table (ids int primary key)

you could fill this table with the id's you need, and preform an optimized inner join.

The problem could be the time it would take to fill it. I guess you could either have a linked server for that, or maybe use BCP utility to fill a temporary table this and then delete it.

andre.barata
  • 663
  • 3
  • 11
  • ill give this a crack and get some stats on performance numbers for this method – undefined Jul 03 '13 at 01:13
  • Or use table-valued parameters, then you don't have to build the table variable and insert the set of rows into it. Unlikely to make a difference in either case, if the index seek is still the bottleneck, but the TVPs should have an edge. – Aaron Bertrand Jul 03 '13 at 01:29
2

First, I think it is a stretch to claim that your data is suggestive of O(n log(n)). (It is great that you did the performance test, by the way.) Here is the time per value:

1000    0.046
2000    0.047
3000    0.083
4000    0.079
5000    0.078
6000    0.078
7000    0.079
8000    0.081
9000    0.083
10000   0.085

Although there is a slight increase as time goes up, the jump from 2000-3000 is much, much more prominent. If this is reproducible, the question to me is why such a discontinuity.

To me, this is more suggestion of O(n) and O(n log(n)). BUT, empirical estimates of theoretical values are difficult to approximate. So, the exact limit is not so important.

I would expect performance to be O(n) (where n is the actual value and not the bit-length as it is in some estimates). My understanding is that the in behaves like a giant set of ors. Most records fail the test, so they have to do all the comparisons. Hence the O(n).

The next question is if you have an index on the id field. In that case, you can get the set of matching ids in O(n log(n)) time (log (n)for traversing the index andn` for doing it for each value). This seams worse, but we have left out the factor for the size of the original table. This should be a big win.

As Andre suggests, you can load a table and do a join to a temporary table. I would leave out the index, because you are probably better off using the index on the larger table. This should get you O(n log(n)) -- with no (significant) dependency on the size of the original table. Or, you can leave out the index and have O(n * m) where m is the size of the original table. I think any index build on the temporary table gets you back to O(n log(n)) performance (assuming the data is not presorted).

Placing everything in a query has a similar, unstated problem -- parsing the query. This takes longer as the string gets longer.

In short, I commend you for doing performance measurements, but not for coming to conclusions about algorithmic complexity. I don't think your data supports your conclusion. Also, the handling of queries is a bit more complicated than you suggest, and you have left out the size of the larger table -- which can have a dominant affect. And, I'm quite curious what is happening between 2000 and 3000 rows.

Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
  • I think you are absolutely right about the algorithmic complexity now I see the per row numbers, (I really just too a guess at complexity based on it appearing to be non-linear) I will update that in the question. I think the reason for the spike is probably to do with optimisations around query planning (see: http://stackoverflow.com/questions/8635818/multiple-insert-statements-vs-single-insert-with-multiple-values) The larger table has 1,000,000 rows in it (its actually in the question) – undefined Jul 03 '13 at 01:26