Explain optimal query for large table with clustered index in SQL Server 2008

Question

I'm working on a very large table (approximately 2.7 million rows added per day), which has the following structure:

CREATE TABLE [dbo].[Result](
    [ResultDate] [date] NOT NULL,
    [Thing1Id] [int] NOT NULL,
    [Num] [int] NOT NULL,
    [Thing2Id] [int] NOT NULL,
CONSTRAINT [PK_Result] PRIMARY KEY CLUSTERED 
(
    [ResultDate] ASC,
    [Thing1Id] ASC,
    [Num] ASC
))

As the clustered primary key is on ResultDate, Thing1Id and Num, I would expect the following query to be optimal:

SELECT Thing2.* 
FROM dbo.Result
INNER JOIN Thing2 ON Thing2.Id = result.Thing2Id
WHERE 
    ResultDate >= '2012-01-01'
    AND
    ResultDate <= '2012-01-30'
    AND Thing1Id = 23

As you can see, the query is finding results in Jan-12 for a particular Thing1.

However, the execution plan indicates that a huge performance increase can be gained by adding the following index:

CREATE NONCLUSTERED INDEX [IX_Missing]
ON [dbo].[Result] ([Thing1Id],[ResultDate])
INCLUDE ([Num],[Thing2Id])

And surely enough, adding this index does improve the performance massively.

Can someone please explain why? As far as I'm concerned the results should be narrowed sufficiently using the clustered primary key, and adding this will make the index size much larger and add unnecessary overhead.

Can I index the table differently to get better performance?

(Please note, in reality the table is actually 2 tables unioned, data is shifted from one to the other daily, and the data is partitioned monthly).

I bet even just adding a non-clustered index on `dbo.Result(Thing2Id)` would speed up your query quite a bit, based on the fact that `Thing2Id` is a foreign key and used in your `INNER JOIN Thing2 ON Thing2.Id = result.Thing2Id` statement... — marc_s, Dec 05 '12 at 13:06
Yes, you're right. I have already added an index on Thing2Id which speeds things up, but I didn't include it in the post because I'm more interested in the clustered index. Thanks. — smartypantsfe, Dec 06 '12 at 10:45
I don't see why an index on Thing2Id would help for this query. Just adding a index because it has an FK can hurt more than it helps on large tables like this. — Wim, Dec 07 '12 at 07:18
Probablby, Thing1id is more selective, or SQL Server is doing some kind of index intersection. Post the execution plans and we will know for sure. — usr, Apr 28 '13 at 14:52
Can you post the [query execution plans](http://stackoverflow.com/questions/7359702/how-do-i-obtain-a-query-execution-plan) for both? — Justin, Jun 04 '13 at 14:11

score 0 · Answer 1 · edited Dec 05 '12 at 11:23

0

Indexes basically arrange your table by 'key'. In your case 'thing1ID', 'ResultDate'. When the table is sorted, accessing rows is much faster than looping over whole table (2.7mil) because you don't have a clue where row might be.

i.e. 2,7,3,8,1, you need to search whole table to get to number 1. But if you would have 1, 2, 3, 7, 8. you check just the first number.

BUT! for a table that has many updates/inserts that are involving 'key' will slow down because you need to sort the table after every entry. So figure out what is best for your DB.

edited Dec 05 '12 at 11:23

Dan Ellis

5,537
8
47
73

answered Dec 05 '12 at 10:59

Gustav Klimt

430
3
14

That is exactly why the clustered index is on ResultDate THEN Thing1Id. I'm afraid your comment doesn't help explain why the extra index is necessary. – smartypantsfe Dec 06 '12 at 10:55

score 0 · Answer 2 · answered Dec 05 '12 at 11:36

0

The PK is not optimal for your query because you are doing a range search on ResultDate. With your query you are narrowing the search for Thing1Id 23 to approx. 81 million rows which still is a lot.

In your query the search on Thing1Id is fixed on 23, and therefor the extra index on Thing1Id and ResultDate will be optimal for your query.

answered Dec 05 '12 at 11:36

Wim

1,967
11
19

Thanks Wim. So is this the best way to index the table, or can you suggest a better way? – smartypantsfe Dec 06 '12 at 10:49
It is impossible to say if it is the best way to index the table without knowing exactly the content of the database and all the queries. But it is a good index for this query. I would just create the index, run some performance tests and analyse the results again. – Wim Dec 07 '12 at 07:12

score 0 · Answer 3 · edited May 23 '17 at 11:57

A query execution plan will tell you for certain what is going on here which is normally a lot better than speculating, however in this case I think there is enough information for an educated guess.

Firstly, the INCLUDE ([Num],[Thing2Id]) part of your index just means that the the value of these two columns is duplicated in the index as well as in the table itself. Its useful because it can prevent SQL Server having to look in the table itself for these details after performing a lookup in that index (in which case the index is a covering index) however generally this lookup is pretty quick and so is unlikely to be directly responsible for the "massively" improved performance. I'm going to guess that the following index is 99.9% as fast.

CREATE NONCLUSTERED INDEX [IX_Missing]
ON [dbo].[Result]
(
    [Thing1Id],
    [ResultDate]
)

Before we continue its important to understand that there are two ways for SQL Server to execute this query (vastly simplified for the purposes of the explanation):

Find all of the rows that have ResultDate between the two supplied dates, and then look through those rows for ones that have a Thing1Id of 23
Find all of the rows that have a Thing1Id of 23, and then look through those rows for ones that have ResultDate between the two supplied dates

Depending on what data is present in the table one of these approaches could be significantly quicker than the other, for example if the majority of rows in the table have a Thing1Id of 23 and very few have a matching ResultDate then its probably going to be quicker to use the first approach as it eliminiates more rows sooner.

The other important piece of puzzle that we need to understand is that because of the way that indexes work, SQL can't use your clustered index in the second case because the Thing1Id column comes after the ResultDate column (it would be a bit like asking someone to use the index in a book to find all of the entries whose second letter is "Q" and then ask them to go through and pick out only those words that begin with "S")

Therefore my guess as to why this index improves the performance is simply that it is more efficient for SQL Server to use approach 2 (filter by Thing1Id first) than approach 1.

You should be able to use the Query Execution plans to confirm this.

Explain optimal query for large table with clustered index in SQL Server 2008

3 Answers3