3

I've got a problem querying a table which has rows with a parent/child relationship. When working up a simplified example, I realized that the stackexchange schema is very similar.

So imagine I'm querying the stackoverflow posts table via stackexchange data explorer. I am trying to get a subset of all posts and their associated answers.

See https://data.stackexchange.com/stackoverflow/query/121981/a-subset-of-questions-and-associated-answers for sample queries.

The subset of posts is defined in a view, which has a fairly complex and expensive query plan. In the below example it has been simplified to simply select the top two rows.

The first way, using a union:

with ExpensiveView as (select top 2 ID from Posts order by ID)

select Posts.*
from ExpensiveView
left outer join Posts
  ON ExpensiveView.Id = Posts.Id 
  
union all

select Posts.*
from ExpensiveView
left outer join Posts
  ON ExpensiveView.Id = Posts.ParentId

I'd quite like to avoid this way, since ExpensiveView is evaluated twice. Obviously not a problem for the simplified version above, but causes issues for one more complex.

The second way, using a single select with a conditional join clause:

with ExpensiveView as (select top 2 ID from Posts order by ID)

select Posts.*
from ExpensiveView
left outer join Posts
  ON ExpensiveView.Id = Posts.Id or ExpensiveView.Id = Posts.ParentId

This avoids ExpensiveView being evaluated twice, but causes a ridiculously large clustered index scan. It seems to be scanning the entire index per ID in ExpensiveView (so 2 * 14977623 = ~30 million rows). This is very slow.

Two questions

Why does the conditional join in the second query result in such a large index scan?

Is there any way I can get the results I am looking for without ExpensiveView being evaluated more than once?

Community
  • 1
  • 1
John
  • 1,502
  • 2
  • 13
  • 40
  • 4
    Regarding your `OR` question see [Is having an 'OR' in an INNER JOIN condition a bad idea?](http://stackoverflow.com/questions/5901791/is-having-an-or-in-an-inner-join-condition-a-bad-idea/5901901#5901901). Any reason why you can't materialize `ExpensiveView` into a `#temp` table? – Martin Smith Jun 26 '13 at 14:24
  • The linked answer says that when using an or in a join clause, then the join cannot be merge or hash and as such will be a loop join - Implying that the loop join is the cause of the large table scan. But if I remove one of the join clauses then the loop join remains, it is just doesn't do the large scan. – John Jun 26 '13 at 15:27
  • Regarding using a temp table, I'm a little concerned about the overheads involved. This is a query which would be running reasonably frequently (up to a few times per second). – John Jun 26 '13 at 15:32
  • 1
    Also `ExpensiveView.Id IN (Posts.Id, Posts.ParentId)` cannot be satisfied by an index seek on `Posts`. In principle it could do two index seeks on `Posts.Id` and then on `Posts.ParentId` then remove any duplicate rows that matched both seeks though I'm not sure SQL Server will ever give that plan in practice. (Does occur to me you could probably simulate something similar with `OUTER APPLY` though) – Martin Smith Jun 26 '13 at 15:33
  • 1
    How does this perform against your data? `with ExpensiveView as (select top 2 ID from Posts order by ID) select Posts.* from ExpensiveView OUTER APPLY (SELECT * FROM Posts WHERE Id = ExpensiveView.Id UNION SELECT * FROM Posts WHERE ParentId = ExpensiveView.Id) Posts` – Martin Smith Jun 26 '13 at 15:37
  • That does perform slightly better, thanks. Your previous comment gave me the inspiration to tweak it a little bit. The union within the outer apply isn't needed - a single select with a 'or' in the where clause generates a plan with two separate index seeks. `with ExpensiveView as (select top 2 ID from Posts order by ID) select Posts.* from ExpensiveView OUTER APPLY (SELECT * FROM Posts WHERE Id = ExpensiveView.Id or ParentId = ExpensiveView.Id) Posts`. Many thanks for the help. – John Jun 26 '13 at 15:47

1 Answers1

0

try this

with
ExpensiveView as (select top 2 ID from Posts order by ID),
CTE_Posts as (
    select *, NP.Id as New_Post_ID
    from Posts as P
        outer apply (select P.Id union all select P.ParentId) as NP
)
select
    P.*
from ExpensiveView as E
    left outer join CTE_Posts as P on P.New_Post_ID = E.ID
Roman Pekar
  • 107,110
  • 28
  • 195
  • 197