7

I my scenario, I have posts, which are grouped in categories. For an overview list of categories I want to display a summary of the top 10 posts with the categories (as opposed to the detail view of a category, that displays the full data). The top 10 posts are determined by a score, which comes from another table (actually an indexed view - but that doesn't matter here).

The table structure is the following:

CREATE TABLE [dbo].[Categories]
(
    [Id] INT NOT NULL IDENTITY CONSTRAINT [PK_Categories] PRIMARY KEY,
    [Key] CHAR(10) CONSTRAINT [UK_Categories_Key] UNIQUE,
    [Caption] NVARCHAR(500) NOT NULL,
    [Description] NVARCHAR(4000) NULL
)
GO

CREATE TABLE [dbo].[Posts]
(
    [Id] INT NOT NULL IDENTITY CONSTRAINT [PK_Posts] PRIMARY KEY,
    [CategoryId] INT NOT NULL CONSTRAINT [FK_Posts_Category] FOREIGN KEY REFERENCES [dbo].[Categories] ([Id]),
    [Key] CHAR(10) CONSTRAINT [UK_Post_Key] UNIQUE,
    [Text] NVARCHAR(4000) NULL,
    [SummaryText] AS
        CASE WHEN LEN([Text]) <= 400
            THEN CAST([Text] AS NVARCHAR(400))
            ELSE CAST(SUBSTRING([Text], 0, 399) + NCHAR(8230) AS NVARCHAR(400)) --First 399 characters and ellipsis
        END
        PERSISTED
)
GO

CREATE TABLE [dbo].[Scores] (
    [Id] INT NOT NULL IDENTITY CONSTRAINT [PK_Scores] PRIMARY KEY,
    [CategoryId] INT NOT NULL CONSTRAINT [FK_Scores_Category] FOREIGN KEY REFERENCES [dbo].[Categories] ([Id]),
    [PostId] INT NOT NULL CONSTRAINT [FK_Scores_Post] FOREIGN KEY REFERENCES [dbo].[Posts] ([Id]),
    [Value] INT NOT NULL
)
GO

CREATE INDEX [IX_Scores_CategoryId_Value_PostId]
    ON [dbo].[Scores] ([CategoryId], [Value] DESC, [PostId])
GO

I can now use a view to get the top ten posts of each category:

CREATE VIEW [dbo].[TopPosts]
AS
SELECT c.Id AS [CategoryId], cp.PostId, p.[Key], p.SummaryText, cp.Value AS [Score]
FROM [dbo].[Categories] c
CROSS APPLY (
    SELECT TOP 10 s.PostId, s.Value
    FROM [dbo].[Scores] s
    WHERE s.CategoryId = c.Id
    ORDER BY s.Value DESC
) AS cp
INNER JOIN [dbo].[Posts] p ON cp.PostId = p.Id

I understand that the CROSS APPLY will use the covering index IX_Scores_CategoryId_Value_PostId, because it contains the category ID (for the WHERE) the value (for the ORDER BY and the SELECT) and the post ID (for the SELECT) and thus will be reasonably fast.

The question is now: what about the INNER JOIN? The join predicate uses the post ID, which is the key of the Post table's clustered index (the primary key). When I create a covering index that includes all the fields of the SELECT (see below), can I significantly increase query performance (with a better execution plan, reduced I/O, index caching etc.), even though accessing the clustered index is already a pretty fast operation?

The covering index would look like this:

CREATE INDEX [IX_Posts_Covering]
    ON [dbo].[Posts] ([Id], [Key], [SummaryText])
GO

UPDATE:

Since the direction of my question doesn't seem entirely clear, let me put down my thoughts in more detail. I am wondering if the covering index (or index with included columns) could be faster for the following reasons (and the performance gain woul be worth it):

  1. Hard drive access. The second index would be considerably smaller than the clustered index, SQL Server would have to go through less pages on the HD, which would yield better read performance. Is that correct and would you see the difference?
  2. Memory consumption. To load the data into the memory, I assume SQL Server would have to load the entire row into memory and then pick the columns it needs. Wouldn't that increase memory consumption?
  3. CPU. My assumption is that you wouldn't see a measurable difference in CPU usage, since extracting the row from the columns is not per se a CPU operation. Correct?
  4. Caching. My understanding is that you won't see much difference in caching, because SQL Server would only cache the data it returns, not the entire row. Or am I wrong?

These are basically (more or less educated) assumptions. I would appreciate it a lot if someone could enlighten me about this admittedly very specific issue.

Sefe
  • 13,731
  • 5
  • 42
  • 55
  • 3
    let SSMS show you an actual execution plan for both options (with and without the additional index), and you will see at once if a) the optimizer will pick the index for the JOIN and b) there is a significant performance gain on your version of SQL server. My guess is yes, since the automatic proposals of the index optimization wizard often contain that kind of an index in scenarios like yours. – Cee McSharpface Nov 14 '16 at 13:18
  • 1
    The execution plan would show me _if_ it uses the index. If it is _worth_ using the index will depend on the amount of data in the tables. I would like to avoid running into performance problems in the production DB. So if you (or someone else) has experince in a similar scenario, it would help me a lot. – Sefe Nov 14 '16 at 13:24
  • see @dlatikay comment above. that is how you get the answer you seek sir. – Neo Nov 14 '16 at 13:24
  • @Sefe that's why indexes are not always, and not necessarily, a static part of a database schema. adding it will have a tradeoff: query speed/insert speed/update speed, and storage size, especially when you include payload (`SummaryText`). in doubt, don't create it right now but get feedback from the DBA of the production database and have them create the index if the empirically obtained execution plan suggests it. – Cee McSharpface Nov 14 '16 at 13:32
  • 2
    @dlatikay: Yes, it is common practice to set up your index based on current performance and empirical data. We're doing that too. I would like to go further than that and try to _understand_ what is the better solution. The better I understand the DBMS, the better I can design a DB. I will gladly go the empirical way, but I like to know why SQL server behaves a certain way. Will a covering index reduce I/O on the table? Will it use less memory on the DB server? Will it cache better? Etc. – Sefe Nov 14 '16 at 19:32
  • @Sefe is this question sufficiently answered for you? – usr Nov 20 '16 at 13:50
  • @usr: I am keeping this question open, since my aim is to understand the inner workings of SQL Server. I know it's a nerdy question but I am still hoping for some more insight. The current answers are already good though, and I have upvoted them. – Sefe Nov 20 '16 at 18:54
  • Right, just trying to find out if there are issues unanswered. – usr Nov 20 '16 at 19:07

3 Answers3

5

This is a fun question because all four sub-questions you raise can be answered with "it depends", which is usually a good sign that the subject matter is interesting.

First of all, if you have an unhealthy fascination with how SQL Server works under the covers (like I do) the go-to source is "Microsoft SQL Server Internals", by Delaney et al. You don't need to read all ~1000 pages, the chapters on the storage engine are interesting enough on their own.

I won't touch the question of whether this particular covering index is useful in this particular case, because I think the other answers have covered that nicely (no pun intended), including the recommendation to use INCLUDE for columns that don't need to be indexed themselves.

The second index would be considerably smaller than the clustered index, SQL Server would have to go through less pages on the HD, which would yield better read performance. Is that correct and would you see the difference?

If you assume the choice is either between reading pages of the clustered index or pages of the covering index, the covering index is smaller1, which means less I/O, better performance, all that niceness. But queries don't execute in a vacuum -- if this is not the only query on the table, the buffer pool may already contain most or all of the clustered index, in which case disk read performance could be negatively affected by having to read the less-frequently used covering index as well. Overall performance may also be decreased by the total increase in data pages. The optimizer considers only individual queries; it will not carefully tune buffer pool usage based on all queries combined (dropping pages happens through a simple LRU policy). So if you create indexes excessively, especially indexes that are used infrequently, overall performance will suffer. And that's not even considering the intrinsic overhead of indexes when data is inserted or updated.

Even if we assume the covering index is a net benefit, the question "would you see the difference" (as in, does performance measurably increase) can only be effectively answered empirically. SET STATISTICS IO ON is your friend here (as well as DBCC DROPCLEANBUFFERS, in a test environment). You can try and guess based on assumptions, but since the outcome depends on the execution plan, the size of your indexes, the amount of memory SQL Server has in total, I/O characteristics, the load on all databases and the query patterns of applications, I wouldn't do this beyond a ballpark guess of whether the index could possibly be useful. In general, sure, if you have a very wide table and a small covering index, it's not hard to see how this pays off. And in general, you will sooner see bad performance from not enough indexes than from too many indexes. But real databases don't run on generalizations.

To load the data into the memory, I assume SQL Server would have to load the entire row into memory and then pick the columns it needs. Wouldn't that increase memory consumption?

See above. The clustered index takes up more pages than the covering index, but whether memory usage is affected positively or negatively depends on how each index is used. In the very worst case, the clustered index is used intensively by other queries that don't profit from your covering index, while the covering index is only of help to a rare query, so all the covering index does is cause buffer pool churn that slows down the majority of your workload. This would be unusual and a sign your server could do with a memory upgrade, but it's certainly possible.

My assumption is that you wouldn't see a measurable difference in CPU usage, since extracting the row from the columns is not per se a CPU operation. Correct?

CPU usage is typically not measurably affected by row size. Execution time is (and that, in turn, does affect usage depending on how many queries you want to run in parallel). Once you've covered the I/O bottleneck by giving your server plenty of memory, there's still the matter of scanning the data in memory.

My understanding is that you won't see much difference in caching, because SQL Server would only cache the data it returns, not the entire row. Or am I wrong?

Rows are stored on pages, and SQL Server caches the pages it reads in the buffer pool. It does not cache result sets, or any intermediate data generated as part of the query execution, or individual rows. If you execute a query twice on an initially empty buffer pool, the second one is typically faster because the pages it needs are already in memory, but that's the only source of speedup.

With that in mind, see the answer to your first question -- yes, caching is affected because the pages of your covering index, if used, are cached separately from the pages of the clustered index, if used.


1 A covering index may not actually be smaller if it's heavily fragmented due to page splits. But this is an academic point, because it's not really about what index is physically larger but how much pages of each are actually accessed.

Jeroen Mostert
  • 27,176
  • 2
  • 52
  • 85
  • clustered index are not real index which make sure the table is organized as a B-tree, while non-clustered index save in separated pages. Correct? – shawn Nov 22 '16 at 04:01
  • 1
    @shawn: clustered and nonclustered indexes are very similar. The difference is that the clustered index is also, in effect, "the table" -- it's not *just* an index but the container of all row data, whereas nonclustered indexes only contain their index keys (plus included columns) and the keys in the clustered index as a pointer. The MSDN has a nice [article](https://technet.microsoft.com/library/ms189051) that explains it. – Jeroen Mostert Nov 22 '16 at 06:58
  • Of course things are not as easy as you wish them to be. I thought that, as always with query tuning, there is no easy answer. Anyway, my original problem has vanished, since now I am not joining a table, but an indexed view (another one), so there is another clustered index that is reasonably small. – Sefe Nov 22 '16 at 07:06
4

No, you do not need this covering index.

Limit the number of indexes for each table: A table can have any number of indexes. However, the more indexes there are, the more overhead is incurred as the table is modified. Thus, there is a trade-off between the speed of retrieving data from a table and the speed of updating the table.

Your scenario is more likely as an OLTP system instead of Data Warehouse, it will have large numbers of on-line transactions(insert, update, delete). So creating this covering index will slow down your modification operations.

Update:

Yes,there will be 10 posts per each category. So if you have N category types, the return result set is at most 10*N post records.

Another Guideline about Index: Create an index if you frequently want to retrieve less than 15 percent of the rows in a large table. (My SQL Tuning instructor suggests us 5 percent) If greater than 15 percent, the final execution plan will not be optimal when we use Index.

Let's consider two extreme cases about your POST table:

  1. Post table just has 10*N records and every category type is hit by post record 10 times. So the final execution plan will full scan POST table instead of using any index.
  2. The number of Post table is greater than (10 * N / 15%), so it will retrieve less than 15% of rows in Post table. The Optimizer will use Post ID field to do join operation. And it should be a hash join.

So even you have created a covering index, the Optimizer will never use it unless you use a hint.

Updated:

Clustered and Nonclustered Indexes Described

shawn
  • 460
  • 3
  • 18
  • 1
    Thanks for the reply. I am aware of the read/write performance tradeoff. In this case, the data is read way more often than it is written. And while I agree to keep the number of indexes low, this operation is one of the most used in the system (if not _the_ most used), so it's probably worth paying the price of the size of the additional index (it's almost at the size limit of 900 bytes). But maybe someone has some practical experience in a similar scenario. And there will be way more than 10 accesses to the posts table. There will be 10 posts _per each category_. – Sefe Nov 14 '16 at 18:54
  • @Sefe thanks for pointing out my mistake. Yes, there will be 10 posts per each category. And I have updated my post – shawn Nov 15 '16 at 13:56
  • Sorry, I don't want to be a pain, but I still don't get it. Why would the Optimizer prefer a clustered index scan/seek over a seek of the index? Even though the index is fairly large, it still produces less IO than a clustered index seek. It won't even have to touch the table when it can traverse the index, so why would it? – Sefe Nov 15 '16 at 20:29
  • @Sefe Clustered indexes make sure the rows in the table are stored in the same order as the index describe. So one table just can have one clustered index. Finally you retrieve data from POST table not index itself. This link is about about Clustered index and Non clustered index:https://msdn.microsoft.com/en-us/library/ms190457.aspx – shawn Nov 18 '16 at 01:18
  • @Sefe I wish this link will help u http://stackoverflow.com/questions/1251636/what-do-clustered-and-non-clustered-index-actually-mean – shawn Nov 18 '16 at 01:36
  • 2
    I am aware of the differences between clustered and non-clustered indices. And it is not correct that it will access the clustered index in any case. That's the whole point of a covering index: to avoid table access. The question here is if it is _considerably_ cheaper in this specific case to access a covering index instead of a clustered index. – Sefe Nov 18 '16 at 06:29
  • @Sefe yes, the covering index will be useful for this case – shawn Nov 18 '16 at 06:56
  • Its complicated. Keep in mind that extra overhead due to a missing index takes resources from other areas for example it might requires locks a bit longer which in turn holds up inserts and can eventually escalate into other things. That's why everyone recommends lots of baselines and measurements and only applying minimal changes and measuring the outcome. – Nick.Mc Nov 19 '16 at 01:26
4

Your nonclustered covering index might give you a nominal added performance benefit over the clustered index, but it is going to depend on the size of the data you are querying. If the number of rows is relatively small, then there will likely be no useful advantage.

Taking a step back, given that your join predicate is only the [Posts].[Id], adding the [Key] and [SummaryText] columns as key columns in the index is unnecessary. They should instead be added as nonkey columns:

CREATE NONCLUSTERED INDEX [IX_Posts_Covering]
    ON [dbo].[Posts] ([Id])
    INCLUDE ([Key], [SummaryText])
GO

Per Microsoft: MSDN - Create Indexes with Included Columns

Redesign nonclustered indexes with a large index key size so that only columns used for searching and lookups are key columns. Make all other columns that cover the query into nonkey columns. In this way, you will have all columns needed to cover the query, but the index key itself is small and efficient.

Include nonkey columns in a nonclustered index to avoid exceeding the current index size limitations of a maximum of 16 key columns and a maximum index key size of 900 bytes. The Database Engine does not consider nonkey columns when calculating the number of index key columns or index key size.

Essentially, the covering index makes a duplicate of the [dbo].[Posts] table excluding the [CategoryId] and [Text] columns. Because you will have fewer columns in the covering index, SQL should be able to stuff in more rows per index page. Based on that assumption (which, admittedly, may need scrutiny), as SQL traverses the b-tree, seeking across pages to find the matching rows, it might perform nominally better on the covering index because it has fewer pages to load and look through.

Regardless of the index choice, you might also consider placing your join to the [Posts] table into the cross apply. That would likely force a seek, though the makeup of your data would determine the efficiency.

CREATE VIEW [dbo].[TopPosts]
AS
SELECT c.[Id] AS [CategoryId], cp.[PostId], 
    cp.[Key], cp.[SummaryText], cp.[Value] AS [Score]
FROM [dbo].[Categories] c
CROSS APPLY (
    SELECT TOP 10 s.[PostId], s.[Value], p.[Key], p.[SummaryText]
    FROM [dbo].[Scores] s
    INNER JOIN [dbo].[Posts] p ON s.[PostId] = p.[Id]
    WHERE s.[CategoryId] = c.[Id]
    ORDER BY s.[Value] DESC
) AS cp

At the end of the day, it is going to depend on the size of your data, disk IO, RAM, etc. You will have to decide if the additional space used by the covering index will justify the nominal performance gain, if any.

A great breakdown of index usage: https://dba.stackexchange.com/a/42568/2916

Community
  • 1
  • 1
uhleeka
  • 698
  • 8
  • 19
  • Thanks for the reply. I've considered using INCLUDEs. I have spent the last half hour trying to find an article where someone made performance tests and the covering index yielded better performance than an index with INCLUDES. I couldn't find it though. My point is that with the large NVARCHAR column in the table a clustered index seek might have to go through many more pages than with the covering index. That's why I am considering the option for the covering index. – Sefe Nov 18 '16 at 21:38
  • @sefe: To your first point about key vs nonkey (INCLUDE), per the MSDN design recommendations, https://msdn.microsoft.com/en-us/library/ms190806.aspx, "Redesign nonclustered indexes with a large index key size so that only columns used for searching and lookups are key columns. Make all other columns that cover the query into nonkey columns. In this way, you will have all columns needed to cover the query, but the index key itself is small and efficient." – uhleeka Nov 19 '16 at 00:04
  • @sefe: Regarding your point about the large NVARCHAR, I agree in principle that the clustered index will have to go through more pages than the covering index. But it seems like it is only something you would consider implementing if you are doing very fine tuning of your query. IMO, the caching, RAM, I/O, fragmentation, statistics, etc. will all be significant factors once you are at that level of tuning. – uhleeka Nov 19 '16 at 00:15