Composite Indexes, the “Include” Keyword, and How They Work

Question

In SQL Server (and most other relational databases), a "Composite Index" is an index with multiple keys. Let's say we have this query that gets run a lot, and we want to create a covering index for this query to speed it up;

SELECT a, b FROM MyTable WHERE c = @val1 AND d = @val2

These are all possible composite indexes that would cover this query;

CREATE INDEX ix1 ON MyTable (c, d, a, b)
CREATE INDEX ix2 ON MyTable (c, d) INCLUDE (a, b)
CREATE INDEX ix3 ON MyTable (d) INCLUDE (a, b, c)
CREATE INDEX ix4 ON MyTable (c) INCLUDE (a, b, d)

But apparently, they don't perform equally. According to Erlan Sommarskog (Microsoft MVP), the first two are faster than the 3rd and 4th, and the 4th is faster than the 3rd.

He goes on to explain;

ix2 is the "best" index, because a and b will not take up space in the higher levels of the index tree. Also, if a or b are updated, in ix2 there can be no page splits or similar as the index tree is unaffected.

However, I am having a hard time grasping what exactly is going on. I do have the general knowledge on b-tree indexes and how they work, but I don't understand the logic behind composite keys. For example;

CREATE INDEX ix1 ON MyTable (c, d, a, b)

Does the order of the columns here matter? If so, why? Also;

CREATE INDEX ix2 ON MyTable (c, d) INCLUDE (a, b)

What is the difference between this composite key and the one above? I don't understand what difference "INCLUDE" makes.

Note: I know there are a lot of posts on Composite Keys, but I believe my last two questions are specific enough to not be a duplicate.

The columns indexed are those used to create the b-tree, so in your first example all 4 columns are part of the index. Included columns just means in addition to the columns which are indexed, so additional data is *included* at the leaf level which avoids needing to do a key lookup to get the information not in the index. So for the example you've given you'll have a slightly faster lookup, because the b-tree consists of only 2 columns, and a fast return of data because no key lookup is required. — Dale K, Feb 08 '23 at 07:41
You'd see this much clearer if you setup these examples and inspect the execution plans. — Dale K, Feb 08 '23 at 07:43
@DaleK thank you for your comment, I understand the "point" of composite keys, I just am trying to understand how exactly they work. Could you answer my questions? — Emre Bener, Feb 08 '23 at 07:57
@BogdanSahlean I don't understand what you mean by "Sorted columns are only Key" — Emre Bener, Feb 08 '23 at 07:58
I don't know what you mean by "how they work"? But you say you understand how a b-tree works? What is it you don't understand? — Dale K, Feb 08 '23 at 08:00
If you write some *select* queries that return the columns in the "keys" and "includes" column and *sort* by the *keys* you see exactly *what* the index looks like. The index *keys* are sorted, the included columns are not. If the table has 100 rows and `c` is a unique number from 1-100 then having `d` in the index is of no use since it's onle beneficial as a tie-breaker for duplicate values of `c` which there wouldn't be. — Stu, Feb 08 '23 at 08:42

score 1 · Answer 1 · edited Feb 08 '23 at 08:04

CREATE INDEX ix1 ON MyTable (c, d, a, b)

Does the order of the columns here matter? If so, why? Also;

Yes, order is very important while creating index, because each column is (from left) next level of deepness in index, so to determine the compilator to use this index you need always seek for c which is the "opener" of this set.

CREATE INDEX ix2 ON MyTable (c, d) INCLUDE (a, b)

What is the difference between this composite key and the one above? I don't understand what difference "INCLUDE" makes.

But keep in mind that for each level of the index it starts to be less efficient, so if you know that > 80% of your queries will only seek by c & d and not a & b, but you will need that information in your SELECT (nor in WHERE) you should INCLUDE them, as part of the leaf at the last level of the index.

There are better explanations than mine so feel free to look at them:

INCLUDE equivalent in Oracle -> INCLUDE How important is the order of columns in indexes? -> ORDER in INDEX set

score 1 · Accepted Answer · answered Feb 08 '23 at 11:24

Does the order of the columns here matter?

Considering only the query in your question with 2 equality predicates, the order of the composite index key columns doesn't matter as long as both are the leftmost key columns of the composite index. Any of the covering indexes below will optimize this query:

CREATE INDEX ix1 ON MyTable (c, d, a, b);
CREATE INDEX ix2 ON MyTable (c, d) INCLUDE (a, b);
CREATE INDEX ix3 ON MyTable (d, c, a, b);
CREATE INDEX ix4 ON MyTable (d, c, b, a);
CREATE INDEX ix5 ON MyTable (d, c) INCLUDE (a, b);

That said, the stats histogram contains only the leftmost index key column so the general guidance is to specify the most selective column first to improve row count estimates and execution plan quality. This consideration is more important for non-trivial queries where the optimizer has many choices and row count estimates are an important factor in choosing the best plan.

Another consideration for key order, which may conflict with the above general guidance, is when the index supports different queries and only some of the key columns are specified (e.g. SELECT a, b FROM MyTable WHERE d = @val2;). In that case, it would be better to specify d as the leftmost column regardless of selectivity in order to allow a single index to optimize multiple queries instead of creating a separate index to optimize the second query.

What is the difference between this composite key and the one above? I don't understand what difference "INCLUDE" makes.

Included columns are not key columns. Key columns are maintained in logical order at every level throughout the b-tree whereas included columns are present only in the b-tree leaf nodes and not ordered. Consequently, the specified order of included columns does not matter. The only purpose of included columns is to help cover queries without adding them as key columns and incurring the associated overhead.

thanks Dan, you cleared up a lot of question marks! if you don't mind, i'd like to as a follow-up question. how would you decide whether to use ix1 or ix2 to cover the query in question? to my understanding, ix1 would provide faster selects, but it would take more space on the disc than ix2. also, ix1 would make the inserts and updates slower because there are more indexes that need to be synched. so, to me ix2 seems to be the best option as it offers a good balance. are my points correct? — Emre Bener, Feb 08 '23 at 13:27
@Lilith, for the query in your question, either ix1 or ix2 will provide the same select performance. The advantage of ix2 included columns vs ix1 key columns is avoiding overhead of maintaining the additional columns in the b-tree during inserts, updates, deletes plus slightly less space. I would go with idx2 if you don't have other queries that would benefit by idx1 (e.g. also column a in the WHERE clause). — Dan Guzman, Feb 08 '23 at 13:41

Composite Indexes, the “Include” Keyword, and How They Work

2 Answers2