32

I am re-iterating the question asked by Mongus Pong Why would using a temp table be faster than a nested query? which doesn't have an answer that works for me.

Most of us at some point find that when a nested query reaches a certain complexity it needs to broken into temp tables to keep it performant. It is absurd that this could ever be the most practical way forward and means these processes can no longer be made into a view. And often 3rd party BI apps will only play nicely with views so this is crucial.

I am convinced there must be a simple queryplan setting to make the engine just spool each subquery in turn, working from the inside out. No second guessing how it can make the subquery more selective (which it sometimes does very successfully) and no possibility of correlated subqueries. Just the stack of data the programmer intended to be returned by the self-contained code between the brackets.

It is common for me to find that simply changing from a subquery to a #table takes the time from 120 seconds to 5. Essentially the optimiser is making a major mistake somewhere. Sure, there may be very time consuming ways I could coax the optimiser to look at tables in the right order but even this offers no guarantees. I'm not asking for the ideal 2 second execute time here, just the speed that temp tabling offers me within the flexibility of a view.

I've never posted on here before but I have been writing SQL for years and have read the comments of other experienced people who've also just come to accept this problem and now I would just like the appropriate genius to step forward and say the special hint is X...

Adamantish
  • 1,888
  • 2
  • 20
  • 23
  • Do you have an example of a long running query? – podiluska Sep 12 '13 at 13:36
  • 7
    I can paste it in but please believe me when I say it's not relevant. The query was of similar complexity when it was taking only 8 seconds to run and the only changes I made were in a detail of aggregation and it went to 120 seconds. It still runs quick if I temp table the subquery. I believe that if I paste the code here you will fix the deckchairs on the Titanic, critique details, possibly find a major rewrite that coaxes it into performing properly but I'm talking about a principle in the abstract here - namely that I would like to spool the results of subqueries. Simple. – Adamantish Sep 12 '13 at 13:50
  • Are you asking two questions here "Why are temp tables faster than subqueries?" and "Whay are temp tables faster than CTEs?" – mbeckish Sep 12 '13 at 14:17
  • 1
    Following article is worth a read regarding spools: http://sqlblog.com/blogs/rob_farley/archive/2013/06/11/spooling-in-sql-execution-plans.aspx – MicSim Sep 12 '13 at 14:51
  • Not quite. In my experience CTEs behave the same as subqueries but with some extra recursion options. It just so happens I tried my current problem with a CTE instead, just in case. – Adamantish Sep 12 '13 at 14:56
  • If you really want the Optimizer to always spool a specific subquery, the just encapsulate it as a non-inline Table-Valued Function. It almost always pre-materializes these. – RBarryYoung Sep 12 '13 at 16:15
  • 1
    @RBarryYoung - One downside of that though is that it will also assume 1 row will be emitted. So it may mess up cardinality estimates for the rest of the plan. – Martin Smith Sep 12 '13 at 16:22
  • @Martin Right. It requires external definition also, which is kludgy when you just want to edit a query. I like your `TOP` trick better, I hadn't thought of that. – RBarryYoung Sep 12 '13 at 16:25
  • @RBarryYoung - But that adds an unnecessary sort and relies on un guaranteed optimiser behaviour so is also a kludge :-) – Martin Smith Sep 12 '13 at 16:28
  • @MartinSmith Does it still work if you nullify the `ORDER BY`? (and technically, I think that might be a "*hack*" rather than a "*kludge*", :-) ). – RBarryYoung Sep 12 '13 at 16:38
  • That unnecessary sort can be put to a little use though. In my case I sorted on the column that would be joined to. Not as good as the optimiser *knowing* it was sorted that way but still a help, right? I will try the TVF approach in future if all else fails. (I think the connoisseur would say "hack".) – Adamantish Sep 12 '13 at 17:16
  • It's really a particularly pretty hack because the ORDER BY clause is only allowable (deemed useful by MS) when in conjunction with TOP which is the part we must explicitly render useless so as to get the use out of the supposedly useless ORDER BY. – Adamantish Sep 12 '13 at 19:14
  • Since it has been over 8 years since this issue was raised I think a better answer now is to fire the MS product manager(s) that chose not to implement the "Materialize CTEs" request on MS Connect circa 2014 to and hire someone who understands that the optimizer needs a dad sometimes along with the advantages of views over SPs (esp. multiway-view nesting, predicate flexibility). – crokusek Jul 24 '22 at 01:14

4 Answers4

13

There are a few possible explanations as to why you see this behavior. Some common ones are

  1. The subquery or CTE may be being repeatedly re-evaluated.
  2. Materialising partial results into a #temp table may force a more optimum join order for that part of the plan by removing some possible options from the equation.
  3. Materialising partial results into a #temp table may improve the rest of the plan by correcting poor cardinality estimates.

The most reliable method is simply to use a #temp table and materialize it yourself.

Failing that regarding point 1 see Provide a hint to force intermediate materialization of CTEs or derived tables. The use of TOP(large_number) ... ORDER BY can often encourage the result to be spooled rather than repeatedly re evaluated.

Even if that works however there are no statistics on the spool.

For points 2 and 3 you would need to analyse why you weren't getting the desired plan. Possibly rewriting the query to use sargable predicates, or updating statistics might get a better plan. Failing that you could try using query hints to get the desired plan.

gliljas
  • 711
  • 8
  • 10
Martin Smith
  • 438,706
  • 87
  • 741
  • 845
  • Yes! That's the charm (the forced intermediate materialisation). Down to 4 seconds. Sure seems like a hacky way to get a very commonly useful thing done but maybe MS will make it a hint or setting later. – Adamantish Sep 12 '13 at 15:01
  • 1
    @Adamantish - Well they haven't closed it as "won't fix" yet which is encouraging from that POV. – Martin Smith Sep 12 '13 at 15:06
  • And your point about no stats on the spool is worth bearing in mind for when I come up against reasons 2 and 3. – Adamantish Sep 12 '13 at 15:14
  • I will probably rarely ever use a temp table again except where I need to reuse it. – Adamantish Sep 12 '13 at 15:22
  • 1
    For a large query that uses 'with a as (...), b as (...)' I found that using 'top 1000000 ... order by' did not seem to give the same speedup as just converting the subqueries to temp tables. This on MSSQL 2008 R2. – Ed Avis Oct 31 '14 at 12:39
  • @EdAvis yes, creating temp tables explicitly is much more reliable and also has other potential benefits in terms of allowing you to add needed indexes and better cardinality estimates for the rest of the plan. Messing around with `top` to try and get this would very much be a plan b for me. E.g. If for some reason the query absolutely had to be in a view so that explicit materialisation was not an option. – Martin Smith Oct 31 '14 at 12:57
  • 1
    To me, Views are the fundamental particle of well modularised set-based programming. They hang onto your indexes and give the optimiser its best shot at putting together a good plan out of a bunch of nested objects. Still, we're talking about how to frustrate the optimiser here so the point may be moot. – Adamantish Nov 18 '14 at 13:30
  • 1
    I've seen cases with user-defined functions where even the `TOP` trick doesn't work. Specifically an expensive user-defined function in a joined subquery _with one row_ was obviously being re-evaluated over and over despite the fact that it only needed to be evaluated once. Storing the subquery in a temp table was the only thing that fixed it and made it run in a reasonable amount of time. – lehiester Mar 18 '20 at 19:49
5

I do not believe there is a query hint that instructs the engine to spool each subquery in turn.

There is the OPTION (FORCE ORDER) query hint which forces the engine to perform the JOINs in the order specified, which could potentially coax it into achieving that result in some instances. This hint will sometimes result in a more efficient plan for a complex query and the engine keeps insisting on a sub-optimal plan. Of course, the optimizer should usually be trusted to determine the best plan.

Ideally there would be a query hint that would allow you to designate a CTE or subquery as "materialized" or "anonymous temp table", but there is not.

Dan Bellandi
  • 526
  • 3
  • 5
  • That's the answer I feared. Is there some practical reason that the engine might have been built this way? My best guess is that it's a relic from the days that diskspace really mattered. Maybe it does still really matter that views, by nature, not only be guaranteed to make no changes to the db leave a light footprint on temp processing diskspace. It's just particularly annoying when you've already made the view central to other processes. – Adamantish Sep 12 '13 at 14:57
  • Force order did it for me! Thank you – CodeMonkey Oct 18 '13 at 18:09
3

Another option (for future readers of this article) is to use a user-defined function. Multi-statement functions (as described in How to Share Data between Stored Procedures) appear to force the SQL Server to materialize the results of your subquery. In addition, they allow you to specify primary keys and indexes on the resulting table to help the query optimizer. This function can then be used in a select statement as part of your view. For example:

CREATE FUNCTION SalesByStore (@storeid varchar(30))
   RETURNS @t TABLE (title varchar(80) NOT NULL PRIMARY KEY,
                     qty   smallint    NOT NULL)  AS
BEGIN
   INSERT @t (title, qty)
      SELECT t.title, s.qty
      FROM   sales s
      JOIN   titles t ON t.title_id = s.title_id
      WHERE  s.stor_id = @storeid
   RETURN
END

CREATE VIEW SalesData As
SELECT * FROM SalesByStore('6380')
erdomke
  • 4,980
  • 1
  • 24
  • 30
  • "In addition, they allow you to specify primary keys and indexes on the resulting table to help the query optimizer." Can you clarify this? The linked article doesn't appear to indicate it is possible for user-defined functions to make PKs or indexes. – Ed Avis Oct 31 '14 at 12:06
  • @EdAvis the result table of multi statement TVFs is just a table variable so the methods here apply http://stackoverflow.com/a/17385085/73226 – Martin Smith Oct 31 '14 at 13:15
  • Exactly. In my example, you can see a primary key is specified for the title column of the return table. – erdomke Oct 31 '14 at 15:12
  • Sorry, my mistake, I was only looking for 'create' statements :-( – Ed Avis Oct 31 '14 at 17:55
  • Not a problem; I have been there. – erdomke Nov 01 '14 at 18:08
  • 1
    A good suggestion that will be the answer in many cases. Just wanted to reiterate a warning @MartinSmith gave about this approach in a comment above: "One downside of that though is that it will also assume 1 row will be emitted. So it may mess up cardinality estimates for the rest of the plan." – Adamantish Nov 18 '14 at 13:25
0

Having run into this problem, I found out that (in my case) SQL Server was evaluating the conditions in incorrect order, because I had an index that could be used (IDX_CreatedOn on TableFoo).

SELECT bar.*
FROM
    (SELECT * FROM TableFoo WHERE Deleted = 1) foo
    JOIN TableBar bar ON (bar.FooId = foo.Id)
WHERE
foo.CreatedOn > DATEADD(DAY, -7, GETUTCDATE())

I managed to work around it by forcing the subquery to use another index (i.e. one that would be used when the subquery was executed without the parent query). In my case I switched to PK, which was meaningless for the query, but allowed the conditions from the subquery to be evaluated first.

SELECT bar.*
FROM
    (SELECT * FROM TableFoo WITH (INDEX([PK_Id]) WHERE Deleted = 1) foo
    JOIN TableBar bar ON (bar.FooId = foo.Id)
WHERE
foo.CreatedOn > DATEADD(DAY, -7, GETUTCDATE())

Filtering by the Deleted column was really simple and filtering the few results by CreatedOn afterwards was even easier. I was able to figure it out by comparing the Actual Execution Plan of the subquery and the parent query.


A more hacky solution (and not really recommended) is to force the subquery to get executed first by limiting the results using TOP, however this could lead to weird problems in the future if the results of the subquery exceed the limit (you could always set the limit to something ridiculous). Unfortunately TOP 100 PERCENT can't be used for this purpose since SQL Server just ignores it.

jahu
  • 5,427
  • 3
  • 37
  • 64