4

I have a SELECT that can return hundreds of rows from a table (table can be ~50000 rows). My app is interested in knowing the number of rows returned, it means something important to me, but it actually uses only the top 5 of those hundreds of rows. What I want to do is limit the SELECT query to return only 5 rows, but also tell my app how many it would have returned (the hundreds). This is the original query:

SELECT id, a, b, c FROM table WHERE a < 2

Here is what I came up with - a CTE - but I don't feel comfortable with the total row count appearing in every column. Ideally I would want a result set of the TOP 5 and a returned parameter for the total row count.

WITH Everything AS
(
   SELECT id, a, b, c FROM table
),
DetermineCount AS
(
   SELECT COUNT(*) AS Total FROM Everything 
)
SELECT TOP (5) id, a, b, c, Total
FROM Everything 
CROSS JOIN DetermineCount;

Can you think of a better way?

Is there a way in T-SQl to return the affected row count of a select top query before the top was applied? @@rowcount would return 5 but I wonder if there is a @@rowcountbeforetop sort of thing.

Thanks in advance for your help.

** Update **

This is what I'm doing now and I kind of like it over the CTE although CTEs as so elegant.

-- @count is passed in as an out param to the stored procedure

CREATE TABLE dbo.#everything (id int, a int, b int, c int);

INSERT INTO #everything 
SELECT id, a, b, c FROM table WHERE a < 2;      

SET @count = @@rowcount;
SELECT TOP (5) id FROM #everything;
DROP TABLE #everything; 
sOltan
  • 447
  • 1
  • 7
  • 20
  • 1
    What's wrong with executing two queries? One with the SELECT COUNT(*) and one with the TOP? – DeCaf Apr 23 '13 at 17:55
  • 1
    You shouldn't use a TOP without an ORDER clause. How do you define TOP? – Nick Vaccaro Apr 23 '13 at 17:57
  • @DeCaf good point but it would look quiet redundant don't you think? especially if the select grows eventually to multiple joins or conditionals. But yes that would give me what I want on the expense of redundant code. – sOltan Apr 23 '13 at 18:30
  • @NickVaccaro good observation, in this case I just want a random 5, ordering doesn't mean much for the app. – sOltan Apr 23 '13 at 18:32
  • 1
    Well `TOP` without `ORDER BY` won't give you "random"... – Aaron Bertrand Apr 23 '13 at 18:33

3 Answers3

4

Here's a relatively efficient way to get 5 random rows and include the total count. The random element will introduce a full sort no matter where you put it.

SELECT TOP (5) id,a,b,c,total = COUNT(*) OVER()
  FROM dbo.mytable
  ORDER BY NEWID();
Aaron Bertrand
  • 272,866
  • 37
  • 466
  • 490
  • This is nice! but it still includes the count in every row as a column, and somehow when I tried it generated a lot of logical reads on an object called 'worktable'. See my update to the original post, I am leaning towards separating the count from the top 5. Thanks – sOltan Apr 23 '13 at 18:47
  • 1
    @sOltan And did these extra reads actually lead to an *observable* performance problem? You seem very confused about your goal. I think that you need to decide if you care whether the count is included in every row of the result or not. There is no easy way to include it only once and also avoid worktable stuff (which I suspect is due to the random order, NOT the count - maybe you should separate those two requirements as well). – Aaron Bertrand Apr 23 '13 at 18:52
  • there was no observable performance problem and your query is awesome, I've made a note of it. Right now, I'm returning the row count as an out parameter, I've posted an update below. Thank you so much for your help and would appreciate any feedback you have on my update. – sOltan Apr 23 '13 at 18:55
  • I did some testing with Server2008R2 and registered a 2x speedup (compared to running "top" and "count" queries separately) but it was very unreliable and more often would do a 10x slowdown instead. It is a pity MSSQL does not support cursors. – Muposat Apr 02 '15 at 23:21
  • Still useful three years later! Used the above to speed up a Redshift Postgres query - thank you! – Alex Hinton Mar 16 '16 at 13:25
2

Assuming you want the top 5 ordering by id ascending, this will do it with a single pass through your table.

; WITH Everything AS 
(
    SELECT id
        , a
        , b
        , c
        , ROW_NUMBER() OVER (ORDER BY id ASC) AS rn_asc
        , ROW_NUMBER() OVER (ORDER BY id DESC) AS rn_desc
    FROM <table>
)
SELECT id
    , a
    , b
    , c
    , rn_asc + rn_desc - 1 AS total_rows
FROM Everything
WHERE rn_asc <= 5
Nick Vaccaro
  • 5,428
  • 6
  • 38
  • 60
0

** Update **

This is what I'm doing now and I kind of like it over the CTE although CTEs as so elegant. Let me know what you think. Thanks!

-- @count is passed in as an out param to the stored procedure

CREATE TABLE dbo.#everything (id int, a int, b int, c int);

INSERT INTO #everything 
SELECT id, a, b, c FROM table WHERE a < 2;      

SET @count = @@rowcount;
SELECT TOP (5) id FROM #everything;
DROP TABLE #everything; 
sOltan
  • 447
  • 1
  • 7
  • 20
  • Why do you need the #temp table? Do you think inserting into a new object and obtaining @@ROWCOUNT is cheaper than just SELECT COUNT(*) from the base table? Also, again, TOP without ORDER BY does not give you random, if that's what you're after. There is a big difference between RANDOM and UNDEFINED. – Aaron Bertrand Apr 23 '13 at 18:57
  • Personally, I don't like output params. I think they're too easy to miss, and that returning a second table with a single value is cleaner. However, that's my opinion, coming from someone who considers what the application layer looks like. I'd prefer to see the C# (or whatever) handling 2 data tables in a set. Best of luck either way! – Nick Vaccaro Apr 23 '13 at 18:57
  • @Aaron I actually meant undefined not random, sorry for the confusion. For the #temp (or a local table var) is a good place to insert the rows in (they would live in memory anyways which is faster than physical IO). The example I show is a simple select from a single table, in reality it's a two-join query I just didn't want to complicate the code in this post. – sOltan Apr 23 '13 at 20:58
  • @NickVaccaro When I did two queries, it looked awfully redundant. I could re-factor the query into a table function I guess. – sOltan Apr 23 '13 at 20:58
  • @@rowcount, doesn't it return data that was already captured by sql when executing the select? I always thought that's the case, resulting in no additional overhead. Anyways, I really appreciate all of your feedback. I haven't decided yet, so I'm still considering all of the suggestions you provided. – sOltan Apr 23 '13 at 20:58
  • 1
    "no additional overhead" you mean aside from scanning the entire table and inserting it into a #temp table? Why do you think this doesn't involve any overhead? – Aaron Bertrand Apr 23 '13 at 21:02
  • 2
    It sounds like you've got enough suggestions to get this done. None of these will be significantly better or worse than any other, so it's all up to you. I would suggest choosing your answer, then closing this question. If you have more questions, present them separately. – Nick Vaccaro Apr 23 '13 at 21:02
  • @Aaron not scanning the entire table, i'm utilizing an index for the data I'm selecting, inserting it into a temp table, and from then on, I'm operating on the temp table (in memory) to get the row count as well as get the top 5. What am I doing wrong or awfully inefficient? – sOltan Apr 23 '13 at 21:13
  • 2
    Ok, so scanning the entire range of the index that matches the WHERE clause (which wasn't in your original question, by the way!). Why do you think a #temp table is "in memory"? I think you have a pretty skewed understanding of what "not any more efficient" and "less efficient" mean. – Aaron Bertrand Apr 23 '13 at 21:13