Which columns to index

Question

For the following query to run quickly which columns I should add to an index?

select max(id) from tb 
group by BranchId,ArticleID
having count(*)>1

Use SQL Server's [Tuning Advisor](https://learn.microsoft.com/en-us/sql/tools/dta/tutorial-database-engine-tuning-advisor?view=sql-server-2017) on that query. — Uwe Keim, Jun 24 '19 at 05:19
BranchId and ArticleId. You may read this post for detail. https://stackoverflow.com/questions/107132/what-columns-generally-make-good-indexes — TAB, Jun 24 '19 at 05:24

Tim Biegeleisen · Accepted Answer · 2019-06-24T07:16:55.933

4

Your current query may not benefit from any index, because of the count(*) > 1 term, which SQL Server might interpret as taking the entire count of each group. But, we can rewrite your query as follows, so that it may use an index:

SELECT MAX(id)
FROM tb
GROUP BY BranchId, ArticleID
HAVING MIN(id) <> MAX(id);

Then, add the following index:

CREATE INDEX idx ON tb (BranchId, ArticleID, Id);

The trick here is to rephrase count(*) > 1 as being logically equivalent to the smallest and largest id value in a group not being the same. Note that I assume here that id is a unique column, i.e. that a given group would never have two or more records with the same id value.

edited Jun 24 '19 at 07:16

answered Jun 24 '19 at 05:21

Tim Biegeleisen

502,043
27
286
360

`count(*)` isn't sargable, but are `min(id)` and `max(id)` sargable? – Zohar Peled Jun 24 '19 at 05:33
1

@ZoharPeled Are you asking rhetorically or are you really unsure? SQL Server _should_ be able to use the index to quickly find min/max, since `id` is covered, and therefore the B-tree should be sorted by that column as well, per group. – Tim Biegeleisen Jun 24 '19 at 05:35
I really don't know. I've never tested it. I remember reading that using anything other than a "pure" column name in the `where` clause is non-sargable, but then reading that some operations (such as casting datetime to date) are sargable, but I don't recall anything about the `having` clause. – Zohar Peled Jun 24 '19 at 06:00
Many Thanks , could you please clarify what is the difference of putting ID filed in columns part of Index vs putting in Include part of Index. Actually tuning adviser suggests Include part – SAM Jun 24 '19 at 07:09
I'm not sure you want to put `Id` into the `INCLUDE` clause. AFAIK, `INCLUDE` means the column value is only present in the leaf node. But, in the case of `Id`, we want the B-tree itself to be built using `Id`. – Tim Biegeleisen Jun 24 '19 at 07:17
SQL Server won't use the B tree to find the min and max in each group even though theoretically it could. This rewrite is no better than the `COUNT` and an index on `BranchId, ArticleID INCLUDE (Id)` would be – Martin Smith Jun 24 '19 at 07:21
@MartinSmith You know more SQL Server than I do. In other databases (e.g. MySQL, Postgres), the database would use the index I suggested. – Tim Biegeleisen Jun 24 '19 at 07:24
SQL Server will be able to use the index with leading columns `BranchId, ArticleID` to avoid a sort for a stream aggregate but it won't leverage the fact that the `Id` is sorted within each group to help with the `MIN`/`MAX`. All rows will still flow into the stream aggregate and the aggregates worked out from that. If this is a worthwhile optimisation (e.g. because there are very few groups with large numbers of rows) this would need a rewrite - e.g. with `APPLY` and a correlated seek to get the highest and lowest value per group – Martin Smith Jun 24 '19 at 07:27
@MartinSmith so if I understand you correctly, using `max(col)` and `min(col)` is not sargable, but it can be written in a sargable way using `apply` - would that be something [like this](https://rextester.com/JJQW20722)? – Zohar Peled Jun 24 '19 at 08:09
@ZoharPeled - hopefully there would be another source of distinct `BranchId, ArticleID` as getting them from groping on the base table would make the rest pointless - may as well get the `MIN` and `MAX` whilst at it. If there was no other source a recursive CTE could be used to seek the next group along – Martin Smith Jun 24 '19 at 08:22
@MartinSmith Thanks, but I'm not sure I understand. Can you post an online example of what you mean? – Zohar Peled Jun 24 '19 at 08:39
@ZoharPeled - I haven't the time at the moment. But along the lines of "a better algorithm" here https://sqlperformance.com/2014/10/t-sql-queries/performance-tuning-whole-plan – Martin Smith Jun 24 '19 at 09:46
As long as one can write a query in a way that she can make the 'searched' columns (columns used in where clause, joins, group by, order by, distinct) seargable, ROWSTORE indexes are fine. But then there are many situations where she might have difficulty in trying to maintain the same or can have a better approach, some of those I have tried to cover in my below post. In many of such cases, COLUMNSTORE index can perform significantly better. – san Jun 25 '19 at 03:32

score 1 · Answer 2 · answered Jun 25 '19 at 03:09

Apart from what Tim has already covered on ROWSTORE indexes, I want to make a few additional comments here along with a demonstration:

As long as you are able to write a query in a way that you can make the 'searched' columns (columns used in where clause, joins, group by, order by, distinct) seargable (when query can make use of the indexed columns), ROWSTORE indexes are fine. But then there are many situations where you might have difficulty in trying to maintain the same or can have a better approach, some of they are:

Having to make the query extra complex just to make sure you can make use of the ROWSTORE indexes
When you have to use certain CAST in your where clause or joins
When you have to use functions in your where clause --- But try to avoid this anyways in your code as a best practice
When you need to use LIKE in your queries (Col1 LIKE '%foo%')
When you need to perform Aggregates and Group By
Users perform sorting in different columns all the time.
And one most important point: when you have low cardinality in your 'searched' columns, your ROWSTORE indexes do not really help much. But here, COLUMNSTORE index can really shine.
Your chief wait is PAGEIOLATCH (reading data pages from disk)
Your tables are very wide but report users do not use all the columns.

Then, using COLUMNSTORE index can provide impressive performance enhancement due its massive data compression for column values arriving from a similar domain. In SQL Server 2017, batch execution can improve the performance further. Columnstore index also gives better index re-organize results.

I cited some of the above points from Brent Ozar's presentation.

Here is a demonstration where I prepared 2 exact same tables with 10 Million rows in each with certain level of cardinality.

Data Preparation:

--Test table1
drop table if exists dbo.dummy

select  top 10000000 objectid1 = a.object_id, Name1 = a.name, objectid2 = b.object_id, Name2 = b.name, objectid3 = c.object_id, Name3 = c.name
into    dbo.dummy
from    sys.objects a
        cross join sys.objects b
        cross join sys.objects c
order   by a.object_id, a.name

drop index if exists ix_dummy on dbo.dummy
go

--create a nonclustered rowstore index
create index ix_dummy on dbo.dummy (objectid1, objectid2, objectid3)
go



--Test Table2
drop table if exists dbo.dummy2

select  top 10000000 objectid1 = a.object_id, Name1 = a.name, objectid2 = b.object_id, Name2 = b.name, objectid3 = c.object_id, Name3 = c.name
into    dbo.dummy2
from    sys.objects a
        cross join sys.objects b
        cross join sys.objects c
order   by a.object_id, a.name

drop index if exists ix_dummy2 on dbo.dummy2
go

--create a nonclustered columnstore index
create nonclustered columnstore index ix_dummy2 on dbo.dummy2 (objectid1, objectid2, objectid3)
go


set statistics io on
set statistics time on

1: On simple search:

Percentage and Reads say Columnstore Index is the Winner, but Elapsed Time says Rowstore Index is the Winner

--Simple search
--Run these 2 queries together and compare their percantage of time taken with respect each other, logical read, elapsed time.
select objectid3
from dbo.dummy
where objectid1 in (5) -- look for some object_id that exists in your database

select objectid3
from dbo.dummy2
where objectid1 in (5) -- look for some object_id that exists in your database

Logical Reads and Elapsed Time:

Execution Plans:

2: Aggregation and Group By Queries: Columnstore Index Wins Big Time

--Agregate queries
----Run these 2 queries together and compare their percantage of time taken with respect each other, logical read, elapsed time.
select max(objectid3)
from dbo.dummy
group by objectid1, objectid2
having max(objectid3) <> min(objectid3)

select max(objectid3)
from dbo.dummy2
group by objectid1, objectid2
having max(objectid3) <> min(objectid3)

Logical Reads and Elapsed Time:

Execution Plans:

I agree that SQL Server version is a factor to consider for using columnstore index. One has to be in SQL Server 2012 or later version. Also, if using SQL Server 2016 or older version, then one has to consider dropping and recreation approach of columnstore indexes during data loads (which is usually fine in overnight data load hours of data warehouse). — san, Jun 25 '19 at 03:18
Great answer, can you verify that the version of the query I wrote might in fact be able to use an index? — Tim Biegeleisen, Jun 25 '19 at 03:33
@TimBiegeleisen Yes, as you can see in my screenshots, the query is exactly the query which you wrote. For the rowstore index example (which you were creating), it made use of the rowstore index. But it performed an index scan as it had to scan the leaf pages anyways to perform the aggregation. And when the scans are necessary, depending on amount of data and cardinality, COLUMNSTORE index can really shine (which is what we see in the above demonstration). — san, Jun 25 '19 at 03:41

Which columns to index

2 Answers2

1: On simple search:

2: Aggregation and Group By Queries: Columnstore Index Wins Big Time