Let us have a simple table order(id: int, category: int, order_date: int)
created using the following script
IF OBJECT_ID('dbo.orders', 'U') IS NOT NULL DROP TABLE dbo.orders
SELECT TOP 1000000
NEWID() id,
ABS(CHECKSUM(NEWID())) % 100 category,
ABS(CHECKSUM(NEWID())) % 10000 order_date
INTO orders
FROM sys.sysobjects
CROSS JOIN sys.all_columns
Now, I have two equivalent queries (at least I believe that they are equivalent):
-- Q1
select distinct o1.category,
(select count(*) from orders o2 where order_date = 1 and o1.category = o2.category)
from orders o1
-- Q2
select o1.category,
(select count(*) from orders o2 where order_date = 1 and o1.category = o2.category)
from (select distinct category from orders) o1
However, when I run those queries they have a significantly different characteristic. The Q2 is twice faster for my data and it is clearly caused by the fact that the query plan first find unique categories (hash match in the following query plans) before the join.
The difference is still there if add requested index
CREATE NONCLUSTERED INDEX ix_order_date ON orders(order_date)
INCLUDE (category)
Moreover, the Q2 can use efficiently also the following index, whereas, the Q1 remains the same:
CREATE NONCLUSTERED INDEX ix_orders_kat ON orders(category, order_date)
My question are:
- Are these queries equivalent?
- If yes, what is the obstacle for the SQL Server 2016 query optimizer to find the second query plan in the case of Q1 (I believe that the search space must be quite small in this case)?
- If no, could you post a counter example?
EDIT
My motivation for the question is that I would like to understand why query optimizers are so poor in rewriting even simple queries and they rely on SQL syntax so heavily. SQL language is a declarative language, therefore, why SQL query processors are driven by syntax so often even for simple queries like this?