1

Let us have a simple table order(id: int, category: int, order_date: int) created using the following script

IF OBJECT_ID('dbo.orders', 'U') IS NOT NULL  DROP TABLE dbo.orders

SELECT TOP 1000000
      NEWID() id, 
      ABS(CHECKSUM(NEWID())) % 100 category, 
      ABS(CHECKSUM(NEWID())) % 10000 order_date
INTO orders
FROM    sys.sysobjects
CROSS JOIN sys.all_columns

Now, I have two equivalent queries (at least I believe that they are equivalent):

-- Q1
select distinct o1.category,
       (select count(*) from orders o2 where order_date = 1 and o1.category = o2.category) 
from orders o1

-- Q2
select o1.category,
       (select count(*) from orders o2 where order_date = 1 and o1.category = o2.category) 
from (select distinct category from orders) o1

However, when I run those queries they have a significantly different characteristic. The Q2 is twice faster for my data and it is clearly caused by the fact that the query plan first find unique categories (hash match in the following query plans) before the join.

enter image description here

The difference is still there if add requested index

CREATE NONCLUSTERED INDEX ix_order_date ON orders(order_date)
INCLUDE (category)

Moreover, the Q2 can use efficiently also the following index, whereas, the Q1 remains the same:

CREATE NONCLUSTERED INDEX ix_orders_kat ON orders(category, order_date)

My question are:

  1. Are these queries equivalent?
  2. If yes, what is the obstacle for the SQL Server 2016 query optimizer to find the second query plan in the case of Q1 (I believe that the search space must be quite small in this case)?
  3. If no, could you post a counter example?

EDIT

My motivation for the question is that I would like to understand why query optimizers are so poor in rewriting even simple queries and they rely on SQL syntax so heavily. SQL language is a declarative language, therefore, why SQL query processors are driven by syntax so often even for simple queries like this?

Radim Bača
  • 10,646
  • 1
  • 19
  • 33
  • Why not just `SELECT category, COUNT(*) FROM dbo.orders WHERE order_date = 1 GROUP BY category;`? Have you dumbed down a more complicated problem? – Aaron Bertrand Jan 12 '18 at 13:51
  • Could you post the query plan using @AaronBertrand's query? That would be the most typical example for the results you want. – justiceorjustus Jan 12 '18 at 13:57
  • @justiceorjustus actually, it is not an equivalent query to the previous two. It leaves out categories when there is no `order_date = 1` – Radim Bača Jan 12 '18 at 14:01
  • @RadimBača Maybe then go for `SELECT category, COUNT(CASE WHEN order_date = 1 THEN 1 END) FROM dbo.orders GROUP BY category;` – justiceorjustus Jan 12 '18 at 14:05
  • @justiceorjustus thanks, conditional aggregation this leads to one sequential scan, which may be good or bad depending on selectivity as shown here: https://stackoverflow.com/questions/45795898/conditional-aggregation-performance Maybe I should have been more clear in my question that I would like to understand why query optimizers are so poor in rewriting even simple queries and they rely on SQL syntax so heavily. – Radim Bača Jan 12 '18 at 14:13
  • You can also use `select o1.category, (select count(*) from orders o2 where order_date = 1 and o1.category = o2.category) from orders o1 GROUP BY o1.category` – Martin Smith Jan 12 '18 at 18:59

1 Answers1

0

The queries are functionally equivalent, meaning that they should return the same data.

However, they are interpreted differently by the SQL engine. The first (SELECT DISTINCT) generates all the results and then removes the duplicates.

The second extracts the distinct values first, so the subquery is only called on the appropriate subset.

An index might make either query more efficient, but it won't fundamentally affect whether the distinct processing occurs before or after the subquery.

In this case, the results are the same. However, that is not necessarily true depending on the subquery.

Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
  • My point was, that the SQL is a declarative language. Therefore, I would expect that the query optimizer finds the same query plan in such simple case. – Radim Bača Jan 12 '18 at 13:58
  • @RadimBača . . . I don't know of any database that would do this optimization with a subquery. It depends on the contents of the subquery, and that can be hard to decipher. – Gordon Linoff Jan 12 '18 at 18:48