Indexes with included columns, what's the difference?

Question

I've never really understood the difference between these two indexes, can someone please explain what the difference is (performance-wise, how the index structure will look like in db, storage-wise etc)?

Included index

CREATE NONCLUSTERED INDEX IX_Address_PostalCode  
ON Person.Address (PostalCode) 
INCLUDE (AddressLine1, AddressLine2, City, StateProvinceID);

'Normal' index

CREATE NONCLUSTERED INDEX IX_Address_PostalCode  
ON Person.Address (PostalCode, AddressLine1, AddressLine2, City, StateProvinceID);

What do you not understand in the documentation? https://msdn.microsoft.com/en-us/library/ms190806.aspx. — Gordon Linoff, Jan 22 '17 at 13:23
Given the documentation I could basically create included indexes for all my queries, reducing the need for interpreter to ever do any table scans. I guess there are drawbacks with this and I'm looking for some clarifications. — dadde, Jan 22 '17 at 13:30
. . Those are called covering indexes for the query in question. They can greatly improve performance of *select* queries. They do require more space and slow down data modifications. — Gordon Linoff, Jan 22 '17 at 13:31
@GordonLinoff Slow down data modifications? You mean UPDATES? Why, please explain more :) — dadde, Jan 22 '17 at 13:32
I guess I want some kind of best practice guidelines on when to use included columns and when to avoid them + some general information on storage aspects and performance — dadde, Jan 22 '17 at 13:44
UPDATEs become more expensive in the presence of more indexes, because when a column value changes, SQL has to update all copies of it. The more indexes that reference a column, the more copies there are to update. — Mark Sowul, Sep 30 '19 at 14:30

Disillusioned · Accepted Answer · 2017-01-22T14:31:41.723

The internal storage of indexes uses a B-Tree structure and consists of "index pages" (the root and all intermediate pages) and "index data pages" (the leaf pages only).

Note do not confuse "index data pages" with the "data pages" (leaf pages of clustered indexes) which store most of the columns of actual data.

Only the index columns are stored on the index pages.
By placing some columns in the INCLUDE section, less data per index key is stored on each page.
Meaning fewer pages are needed to hold the index keys. (Making it easier to cache these frequently used pages in memory for longer.)
And possibly fewer levels in the tree. (In such a case performance benefits can be much bigger because every tree level traversal is another disk access.)

When an index is used, the index key is used to navigate through the index pages to the correct index data page.

If the index has INCLUDE columns, that data is immediately available should the query need it.
If the query requires columns not available in either the index keys or the INCLUDE columns, then an additional "bookmark lookup" is required to the correct row in the clustered index (or heap if no clustered index defined).

Some things to note that hopefully addresses some of your confusion:

If the keys of your index and filters in your query are not selective enough, then the index will be ignored (regardless of what's in your INCLUDE columns).
Every index you create has overhead for INSERT and UPDATE statements; more so for "bigger" indexes. (Bigger applies to INCLUDE columns as well.)
So while you could in theory create a multitude of big indexes with include columns to match all the permutations of access paths: it would be very counter-productive.

It's worth noting that before INCLUDE columns were added as a feature:

It was a common index tuning 'trick' to expand the keys of an index to include columns that weren't needed in the index/filter. (Known as a covering index.)
These columns were commonly required in output columns or as reference columns for joins to other tables.
This would avoid the infamous "bookmark lookups", but had the disadvantage of making the index 'wider' than strictly necessary.
In fact very often the earlier columns in the index would already identify a unique row meaning the extra included columns would be completely redundant if not for the "avoiding bookmark lookups" benefit.
INCLUDE columns basically allow the same benefit more efficiently.

NB Something very important to point out. You generally get zero benefit out of INCLUDE columns in your indexes if you're in the lazy habit of always writing your queries as SELECT * .... By returning all columns you're basically ensuring a bookmark lookup is required in any case.

I think first line is quite misleading. There is nothing called `index data pages`. Non clustered will never have `data pages`. Both `Key` and `include column` will stored in Index pages only — Pரதீப், Jan 22 '17 at 14:25
@Prdp I did ponder that term a fair bit as I was doing the write-up. I wanted to make the distinction between _normal index pages_ and those that have the additional include column data. Which is why I settled on _index data pages_. — Disillusioned, Jan 22 '17 at 14:29
I know this is old, but is it safe to say that it is always better to use the INCLUDE option (included index) instead of including all the columns in the index key (normal index)? — Edgar Allan Bayron, Jun 21 '18 at 15:04

Pரதீப் · Answer 2 · 2017-01-22T14:44:00.903

In first Index, in Index page only PostalCode is the key column and AddressLine1, AddressLine2, City, StateProvinceID are part of leaf node to avoid key/RID lookup

I will prefer first index when my table will be filtered always on PostalCode and any of this columns AddressLine1, AddressLine2, City, StateProvinceID will be part of select and not filtration

select AddressLine1, AddressLine2, City, StateProvinceID
from Person.Address 
Where PostalCode=

In second index, in Index page there will be five key columns PostalCode, AddressLine1, AddressLine2, City, StateProvinceID

I will prefer second index when I have possiblity to filter data like

Where PostalCode = And AddressLine1 =

or

Where PostalCode = And AddressLine2 =

or

Where PostalCode = And AddressLine1  = and AddressLine2 =

and so on..

At any case the first column in index should be part of filtration to utilize the index

`select *` is not really a good example because as soon as the `select` returns any columns not in the index key or includes: a bookmark lookup would be required. — Disillusioned, Jan 22 '17 at 14:43

score 2 · Answer 3 · edited Jun 20 '20 at 09:12

In the first example, only the index column: PostalCode is stored in the index tree with all the other columns stored in leaf level of the index. This makes the index smaller in size and is useful if you wouldn't be using a where, Join, group by against the other columns but only against the PostalCode.

In the second index, all the data for all the columns are stored in the index tree, this makes the index much bigger but is useful if you would be using any of the columns in a WHERE/JOIN/GROUP BY/ORDER By statements.

Include columns makes it faster to retrieve the data when they are specified in the select list.

For example if you are running:

SELECT PostalCode, AddressLine1, AddressLine2, City, StateProvinceID 
FROM Person.Address 
Where PostalCode= 'A1234'

This will benefit from creating an index on PostalCode and including all the other columns

On the other hand, if you are running:

SELECT PostalCode, AddressLine1, AddressLine2, City, StateProvinceID 
FROM Person.Address 
Where PostalCode= 'A1234' or City = 'London' or StateProvinceID = 1 or AddressLine1 = 'street A' or AddressLine2 = 'StreetB'

This would benefit more from having all the columns in the index

Have a look at the links below, these might help more with your query

Index with Included Column: https://msdn.microsoft.com/en-us/library/ms190806(v=sql.105).aspx

Table and Index Organization: https://msdn.microsoft.com/en-us/library/ms189051(v=sql.105).aspx

In 1st para: if `PostalCode` is _selective_ enough (i.e. narrows down to a small number of rows) it can still be more efficient than scanning the table, so the index can still be useful in the other cases you mention. `JOIN` has another consideration: if used as a reference to lookup into other tables, then `INCLUDE` is great! If used to filter from other tables, then existing key column selectivity comes into play and might benefit adding to index key. Finally second query is likely to table scan regardless of index because of `OR` (perhaps you meant `AND`?). — Disillusioned, Jan 22 '17 at 14:56
Fully agree with all your comments, the examples I have put there were just to highlight the differences between the two scenarios raised by the OP without going into too much details about index uses etc. Thank you for adding the extra info though, sure would be useful if someone looks at the question :) — user7415753, Jan 22 '17 at 19:11

Indexes with included columns, what's the difference?

3 Answers3

Linked