I've got a problem querying a table which has rows with a parent/child relationship. When working up a simplified example, I realized that the stackexchange schema is very similar.
So imagine I'm querying the stackoverflow posts table via stackexchange data explorer. I am trying to get a subset of all posts and their associated answers.
See https://data.stackexchange.com/stackoverflow/query/121981/a-subset-of-questions-and-associated-answers for sample queries.
The subset of posts is defined in a view, which has a fairly complex and expensive query plan. In the below example it has been simplified to simply select the top two rows.
The first way, using a union:
with ExpensiveView as (select top 2 ID from Posts order by ID)
select Posts.*
from ExpensiveView
left outer join Posts
ON ExpensiveView.Id = Posts.Id
union all
select Posts.*
from ExpensiveView
left outer join Posts
ON ExpensiveView.Id = Posts.ParentId
I'd quite like to avoid this way, since ExpensiveView
is evaluated twice. Obviously not a problem for the simplified version above, but causes issues for one more complex.
The second way, using a single select with a conditional join clause:
with ExpensiveView as (select top 2 ID from Posts order by ID)
select Posts.*
from ExpensiveView
left outer join Posts
ON ExpensiveView.Id = Posts.Id or ExpensiveView.Id = Posts.ParentId
This avoids ExpensiveView
being evaluated twice, but causes a ridiculously large clustered index scan. It seems to be scanning the entire index per ID in ExpensiveView
(so 2 * 14977623 = ~30 million rows). This is very slow.
Two questions
Why does the conditional join in the second query result in such a large index scan?
Is there any way I can get the results I am looking for without ExpensiveView
being evaluated more than once?