9

I'm working to update a stored procedure that current selects up to n rows, if the rows returned = n, does a select count without the limit, and then returns the original select and the total impacted rows.

Kinda like:

SELECT TOP (@rowsToReturn)
    A.data1,
    A.data2
FROM
    mytable A

SET @maxRows = @@ROWCOUNT
IF @rowsToReturn = @@ROWCOUNT
BEGIN
  SET @maxRows = (SELECT COUNT(1) FROM mytableA)
END    

I'm wanting reduce this to a single select statement. Based on this question, COUNT(*) OVER() allows this, but it is put on every single row instead of in an output parameter. Maybe something like FOUND_ROWS() in MYSQL, such as a @@TOTALROWCOUNT or such.

As a side note, since the actual select has an order by, the data base will need to already traverse the entire set (to make sure that it gets the correct first n ordered records), so the database should already have this count somewhere.

Community
  • 1
  • 1
Lawtonfogle
  • 974
  • 4
  • 17
  • 32
  • 2
    It's probably more efficient to include the data at the end of every single row, because the alternative is to execute the query logic twice. – Joel Coehoorn Nov 21 '14 at 20:53
  • 2
    There's not a straight forward way that doesn't entail either running the query twice or materialising into a temp table then selecting the desired rows. If the plan has a blocking sort you could in theory collect the actual execution plan and parse out the actual number of rows entering the sort but I don't seriously recommend that. – Martin Smith Nov 21 '14 at 20:53
  • why don't you just select from the sys table that tracks the total number of rows in the table? – Zane Nov 21 '14 at 21:05
  • @Zane I left out the where clause and all the joins for the sake of simplicity. If this was merely counting the rows of a table it wouldn't have been a problem. – Lawtonfogle Nov 21 '14 at 21:09
  • @JoelCoehoorn I was hoping for an alternative that wasn't rerunning it. – Lawtonfogle Nov 21 '14 at 21:10
  • Don't you just return the whole set and add the top portion to the display layer? – Zane Nov 21 '14 at 21:12
  • 3
    `@@ROWCOUNT` gets reset after every statement, so unless `@RowsToReturn` is 1, that second check is never going to work the way you think... – Aaron Bertrand Nov 22 '14 at 00:31
  • Really a sort algorithm by nature would have a count? Do you know what a bubble sort is? If I asked 30 students to line up by height why would they have any reason to know the count? Why is this getting up votes rather than down votes? – paparazzo Nov 24 '14 at 20:17
  • @Blam Because to sort n items, you have to look at n items. Thus, you will have been able to count them. Does this mean you actually did count them? No. But you could have. Compared this to just taking the first n items. You don't know how many are left, you never had to see them. In short, I am not saying it must have a count, I am saying it can have a count at no extra cost, and asking if in this case it actually does and if so, is there a way to access it. – Lawtonfogle Nov 24 '14 at 22:03
  • Really no extra cost. So you don't know what a bubble sort it. Really you think a sort touches each row exactly once? Please post your algorithm for sort and count with no extra cost for the count. – paparazzo Nov 25 '14 at 13:35
  • A bubble sort of n items, even in the best case, would still have to go through all the items at least once. It would only need to have a single int that increments for every item it looks at on the first pass. `i++;` is no cost in any application where you would be using SQL Server. – Lawtonfogle Nov 25 '14 at 13:50
  • @Lawtonfogle There are these thing called indexes and joins. The query optimizer will do things early when it can - including sorts. It does not blinding run where and do sort last. – paparazzo Nov 25 '14 at 17:57

4 Answers4

7

As @MartinSmith mentioned in a comment on this question, there is no direct (i.e. pure T-SQL) way of getting the total numbers of rows that would be returned while at the same time limiting it. In the past I have done the method of:

  • dump the query to a temp table to grab @@ROWCOUNT (the total set)
  • use ROW_NUBMER() AS [ResultID] on the ordered results of the main query
  • SELECT TOP (n) FROM #Temp ORDER BY [ResultID] or something similar

Of course, the downside here is that you have the disk I/O cost of getting those records into the temp table. Put [tempdb] on SSD? :)


I have also experienced the "run COUNT(*) with the same rest of the query first, then run the regular SELECT" method (as advocated by @Blam), and it is not a "free" re-run of the query:

  • It is a full re-run in many cases. The issue is that when doing COUNT(*) (hence not returning any fields), the optimizer only needs to worry about indexes in terms of the JOIN, WHERE, GROUP BY, ORDER BY clauses. But when you want some actual data back, that could change the execution plan quite a bit, especially if the indexes used to get the COUNT(*) are not "covering" for the fields in the SELECT list.
  • The other issue is that even if the indexes are all the same and hence all of the data pages are still in cache, that just saves you from the physical reads. But you still have the logical reads.

I'm not saying this method doesn't work, but I think the method in the Question that only does the COUNT(*) conditionally is far less stressful on the system.


The method advocated by @Gordon is actually functionally very similar to the temp table method I described above: it dumps the full result set to [tempdb] (the INSERTED table is in [tempdb]) to get the full @@ROWCOUNT and then it gets a subset. On the downside, the INSTEAD OF TRIGGER method is:

  • a lot more work to set up (as in 10x - 20x more): you need a real table to represent each distinct result set, you need a trigger, the trigger needs to either be built dynamically, or get the number of rows to return from some config table, or I suppose it could get it from CONTEXT_INFO() or a temp table. Still, the whole process is quite a few steps and convoluted.

  • very inefficient: first it does the same amount of work dumping the full result set to a table (i.e. into the INSERTED table--which lives in [tempdb]) but then it does an additional step of selecting the desired subset of records (not really a problem as this should still be in the buffer pool) to go back into the real table. What's worse is that second step is actually double I/O as the operation is also represented in the transaction log for the database where that real table exists. But wait, there's more: what about the next run of the query? You need to clear out this real table. Whether via DELETE or TRUNCATE TABLE, it is another operation that shows up (the amount of representation based on which of those two operations is used) in the transaction log, plus is additional time spent on the additional operation. AND, let's not forget about the step that selects the subset out of INSERTED into the real table: it doesn't have the opportunity to use an index since you can't index the INSERTED and DELETED tables. Not that you always would want to add an index to the temp table, but sometimes it helps (depending on the situation) and you at least have that choice.

  • overly complicated: what happens when two processes need to run the query at the same time? If they are sharing the same real table to dump into and then select out of for the final output, then there needs to be another column added to distinguish between the SPIDs. It could be @@SPID. Or it could be a GUID created before the initial INSERT into the real table is called (so that it can be passed to the INSTEAD OF trigger via CONTEXT_INFO() or a temp table). Whatever the value is, it would then be used to do the DELETE operation once the final output has been selected. And if not obvious, this part influences a performance issue brought up in the prior bullet: TRUNCATE TABLE cannot be used as it clears the entire table, leaving DELETE FROM dbo.RealTable WHERE ProcessID = @WhateverID; as the only option.

    Now, to be fair, it is possible to do the final SELECT from within the trigger itself. This would reduce some of the inefficiency as the data never makes it into the real table and then also never needs to be deleted. It also reduces the over-complication as there should be no need to separate the data by SPID. However, this is a very time-limited solution as the ability to return results from within a trigger is going bye-bye in the next release of SQL Server, so sayeth the MSDN page for the disallow results from triggers Server Configuration Option:

    This feature will be removed in the next version of Microsoft SQL Server. Do not use this feature in new development work, and modify applications that currently use this feature as soon as possible. We recommend that you set this value to 1.


The only actual way to do:

  • the query one time
  • get a subset of rows
  • and still get the total row count of the full result set

is to use .Net. If the procs are being called from app code, please see "EDIT 2" at the bottom. If you want to be able to randomly run various stored procedures via ad hoc queries, then it would have to be a SQLCLR stored procedure so that it could be generic and work for any query as stored procedures can return dynamic result sets and functions cannot. The proc would need at least 3 parameters:

  • @QueryToExec NVARCHAR(MAX)
  • @RowsToReturn INT
  • @TotalRows INT OUTPUT

The idea is to use "Context Connection = true;" to make use of the internal / in-process connection. You then do these basic steps:

  1. call ExecuteDataReader()
  2. before you read any rows, do a GetSchemaTable()
  3. from the SchemaTable you get the result set field names and datatypes
  4. from the result set structure you construct a SqlDataRecord
  5. with that SqlDataRecord you call SqlContext.Pipe.SendResultsStart(_DataRecord)
  6. now you start calling Reader.Read()
  7. for each row you call:
    1. Reader.GetValues()
    2. DataRecord.SetValues()
    3. SqlContext.Pipe.SendResultRow(_DataRecord)
    4. RowCounter++
  8. Rather than doing the typical "while (Reader.Read())", you instead include the @RowsToReturn param: while(Reader.Read() && RowCounter < RowsToReturn.Value)
  9. After that while loop, call SqlContext.Pipe.SendResultsEnd() to close the result set (the one that you are sending, not the one you are reading)
  10. then do a second while loop that cycles through the rest of the result, but never gets any of the fields: while (Reader.Read()) { RowCounter++; }
  11. then just set TotalRows = RowCounter; which will pass back the number of rows for the full result set, even though you only returned the top n rows of it :)

Not sure how this performs against the temp table method, the dual call method, or even @M.Ali's method (which I have also tried and kinda like, but the question was specific to not sending the value as a column), but it should be fine and does accomplish the task as requested.

EDIT:
Even better! Another option (a variation on the above C# suggestion) is to use the @@ROWCOUNT from the T-SQL stored procedure, sent as an OUTPUT parameter, rather than cycling through the rest of the rows in the SqlDataReader. So the stored procedure would be similar to:

CREATE PROCEDURE SchemaName.ProcName
(
   @Param1 INT,
   @Param2 VARCHAR(05),
   @RowCount INT OUTPUT = -1 -- default so it doesn't have to be passed in
)
AS
SET NOCOUNT ON;

{any ol' query}

SET @RowCount = @@ROWCOUNT;

Then, in the app code, create a new SqlParameter, Direction = Output, for "@RowCount". The numbered steps above stay the same, except the last two (10 and 11), which change to:

  1. Instead of the 2nd while loop, just call Reader.Close()
  2. Instead of using the RowCounter variable, set TotalRows = (int)RowCountOutputParam.Value;

I have tried this and it does work. But so far I have not had time to test the performance against the other methods.

EDIT 2:
If the T-SQL stored procs are being called from the app layer (i.e. no need for ad hoc execution) then this is actually a much simpler variation of the above C# methods. In this case you don't need to worry about the SqlDataRecord or the SqlContext.Pipe methods. Assuming you already have a SqlDataReader set up to pull back the results, you just need to:

  1. Make sure the T-SQL stored proc has a @RowCount INT OUTPUT = -1 parameter
  2. Make sure to SET @RowCount = @@ROWCOUNT; immediately after the query
  3. Register the OUTPUT param as a SqlParameter having Direction = Output
  4. Use a loop similar to: while(Reader.Read() && RowCounter < RowsToReturn) so that you can stop retrieving results once you have pulled back the desired amount.
  5. Remember to not limit the result in the stored proc (i.e. no TOP (n))

At that point, just like what was mentioned in the first "EDIT" above, just close the SqlDataReader and grab the .Value of the OUTPUT param :).

Solomon Rutzky
  • 46,688
  • 9
  • 128
  • 171
  • @srutsky . . . I don't think an answer that starts "I think there is an arcane way to do what you want" is really *advocating* anything. Could you change that word in your answer? – Gordon Linoff Nov 24 '14 at 21:36
2

How about this....

DECLARE @N INT = 10

;WITH CTE AS
 (
  SELECT 
    A.data1,
    A.data2
  FROM  mytable A 
 )
SELECT TOP (@N) * , (SELECT COUNT(*) FROM CTE) Total_Rows
FROM CTE 

The last column will be populated with the total number of rows it would have returned without the TOP Clause.

The issue with your requirement is, you are expecting a SINGLE select statement to return a table and also a scalar value. which is not possible.

A Single select statement will return a table or a scalar value. OR you can have two separate selects one returning a Scalar value and other returning a scalar. Choice is yours :)

M.Ali
  • 67,945
  • 13
  • 101
  • 127
1

Just because you think TSQL should have a row count because of a sort doe not mean it does. And if it does it does it is not currently sharing it with the outside world.

What you are missing is this is very efficient

select count(*) 
from ...
where ...
select top x 
from ...
where ... 
order by ...

With the count(*) unless the query is just plain ugly those indexes should be in memory.

It has to perform a count to sort based on what?
Did you actually evaluate any query plans?
If TSQL has to perform a sort then explain the following.
Why is the count(*) 100% of the cost when the second had to do a count anyway?
Just where in that second query plan is there a free opportunity to count?
Why are those query plans so different if they both need to count?

enter image description here

paparazzo
  • 44,497
  • 23
  • 105
  • 176
  • I plea ugly to the second degree. As a side note, at current we only rerun the query if there is equal to or more than the max records returned. Wouldn't running it twice every time be worse? – Lawtonfogle Nov 21 '14 at 21:29
  • 1
    If the count is le then max how ugly can it be? You don't have a count problem your query problem if it is so ugly it is not in memory. Really you know for a fact query is not in memory? Have you tested? This sure smells like premature optimization to me. I do exactly this with some queries that hit tables with over 100 million rows. – paparazzo Nov 21 '14 at 21:32
  • @Lawtonfogle and Blam: there are some inaccuracies here. SQL Server doesn't cache the result set, just the data pages and execution plan that it used. The two plans are different because the queries are themselves different. And doing a `COUNT(*)` first and then the query is sometimes efficient, but sometimes not. It all depends on the query and indexes, etc. Change it up and you can see a big difference. However, Lawtonfogle _is_ correct. SQL Server has to have the entire result set prior to ordering (it's a 2nd pass operation). The total row count before TOP is applied _is_ in the XML plan. – Solomon Rutzky Nov 26 '14 at 16:07
  • @srutzky I did not say it cached the results - I said index still in memory. And it clearly does not have to have count as is demonstrated by the the query plan. And reflected by cost relative to batch. The count is more expensive than the top - if top required a count it would be more expensive. Batch cost is not dependent on order. Buy a vowel - if I am sorting by color then age and get to the top before getting to the second color I stop and tsql is also clearly smart enough to stop - see that index seek EARLY in the plan and a top cost of 1%. – paparazzo Nov 26 '14 at 16:25
  • @srutzky Really "SQL Server has to have the entire result set prior to ordering (it's a 2nd pass operation)". Where is the evidence that it is a second pass operation? As to "has to have the entire result set prior to ordering" to let me give you a simple example the proves that wrong - table with identity column as pk a second column as varchar name. Select top 1 name from table order by iden. It does an index seek gets one row and stops. Why in the world would it need to evaluate every row? – paparazzo Nov 26 '14 at 16:38
  • Well, technically you said that the "query" is in memory (comment #2 above). But regardless, please do me a favor. Run the queries you are showing again, and this time hover over the "Nested Loop (Inner Join) of the 2nd query and look for "Actual Rows". Does it show the same value as the 1st query doing the `COUNT(*)` reports? – Solomon Rutzky Nov 26 '14 at 16:40
  • @srutzky Answer trumps "With the count(*) unless the query is just plain ugly those indexes should be in memory." There is no actual and the estimated is different. Run your own analysis. – paparazzo Nov 26 '14 at 16:52
  • The "Select top 1 name from table order by iden" example is over-simplified as no ordering is needed. The datapages are already in that physical order and it just needs to count how many records to grab based on `TOP (1)` or `TOP(5)`. But `ORDER BY`, unlike `GROUP BY` and `WHERE`, can refer to column aliases and ordinal positions. `ORDER BY` only works on the total set of a UNION and not the individual queries. Yes I have done my own testing, on non-ugly queries. What does ugly have to do with it? Whatever data pages are read stay unless forced out due to needing the space for other queries. – Solomon Rutzky Nov 26 '14 at 17:00
  • @srutzky So simplified - it proves your assertion is false. In the less than simple example I posted the TSQL also does not count to top with order by. I am not going to play word games with you on ugly. Enough - lets just leave it as you don't agree. – paparazzo Nov 26 '14 at 17:31
  • Well, I did just find a case where the ActualRows matching was not in the plan. So it is often, but not always, there. I guess it is just a matter of what short-cuts, if any, can be taken. – Solomon Rutzky Nov 26 '14 at 17:35
  • @srutzky Are you now inferring that use of an alias or ordinal position would effect a query plan. The query optimizer is based the the query not what happens to be in memory. UNION - yes that is different. – paparazzo Nov 26 '14 at 17:42
0

I think there is an arcane way to do what you want. It involves triggers and non-temporary tables. And, I should mention, although I have implemented each piece (for different purposes), I have never put them together for this purpose.

The idea starts with this Stack Overflow question. According to this source, @@ROWCOUNT counts the number of attempted inserts, even when they don't really happen. Now, I must admit that a perusal of available documentation doesn't seem to touch on this topic, so this may or may not be "correct" behavior. This method is relying on this "problem".

So, you could do what you want by:

  • Creating a new table for the output -- but not a table variable or a temporary table.
  • Creating an "instead of" trigger that prevents more than @maxRows from going into the table.
  • Select the query results into the table.
  • Read @@ROWCOUNT after the select.

Note that you can create the table and trigger using dynamic SQL. You could also create it once, and have the trigger read the @maxRows value from some sort of parameter table. As mentioned before, this needs to be a real table that supports triggers.

Community
  • 1
  • 1
Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
  • Sorry, meant to comment earlier but got distracted. Basically, this solution offers no benefit over adding a mere few lines of code to the existing proc to dump the full results to a temp table, grab @@ROWCOUNT, then select the subset from the temp table. Instead, it's highly complex (more so than is stated) and will perform worse due to additional I/O. It's purely academic and shouldn't be used on a live system. Since people don't always read comments and/or understand the implications of the proposal, I was indicating that this is a lose-lose situation and should not be attempted. – Solomon Rutzky Nov 24 '14 at 06:10
  • @srutzky . . . Not true at all. This solution would not write the additional data to the table, only count the additional rows. That could be a significant difference in performance. – Gordon Linoff Nov 24 '14 at 12:36
  • Gordon, I did not say that your solution would write the data to the "real" table. But it does go somewhere. Think in terms of what SQL Server is doing, not in terms of net effect. You should try actually testing this idea. I have and it performed as expected: much worse than the temp table (6x - 7x the logical reads). And that doesn't even account for the cleanup step required to get your solution to work for multiple users. Maybe you could start by explaining exactly how one would "prevent more than @maxRows from going into the table". – Solomon Rutzky Nov 24 '14 at 16:31
  • @srutzky . . . I cannot argue with your claim to having tested it (I don't even think this answer comes across as "this is a good idea", just "this seems to be possible"). But an `instead of` trigger would not be writing to the log, would not be updating indexes, and would not be allocating new pages for the data that is not inserted into the table. That seems like a savings. – Gordon Linoff Nov 24 '14 at 21:29
  • re: "I cannot argue with your claim to having tested it": _this_ is the main problem. You haven't tested it. Else you would know that it's very inefficient and overly complicated. You shouldn't be saying it "seems like a savings". You should be able to say that it _is_ a savings because you've seen it. The fact is, this is a half-baked idea and you don't know it because a) you don't believe me, and b) you haven't tried it yourself to know otherwise. Neither I nor SQL Server care what you, or anyone else, thinks will happen; the code does what it does. Get this working and then we can discuss. – Solomon Rutzky Nov 26 '14 at 15:28