8

I have a table with close to 30 million records. Just several columns. One of the column 'Born' have not more than 30 different values and there is an index defined on it. I need to be able to filter on that column and efficiently page through results.

For now I have (example if the year I'm searching for is '1970' - it is a parameter in my stored procedure):

WITH PersonSubset as
(
    SELECT *, ROW_NUMBER() OVER (ORDER BY Born asc) AS Row
    FROM Person WITH (INDEX(IX_Person_Born)) 
    WHERE Born = '1970'
)
SELECT *, (SELECT count(*) FROM PersonSubset) AS TotalPeople
FROM PersonSubset
WHERE Row BETWEEN 0 AND 30

Every query of that sort (only Born parameter used) returns just over 1 million results. I've noticed the biggest overhead is on the count used to return the total results. If I remove (SELECT count(*) FROM PersonSubset) AS TotalPeople from the select clause the whole thing speeds up a lot.

Is there a way to speed up the count in that query. What I care about is to have the paged results returned and the total count.

mrt
  • 1,669
  • 3
  • 22
  • 32
  • 5
    Try using `COUNT(*) OVER()` inside the CTE instead of trying to run a subquery on every row of the output. – Aaron Bertrand Nov 29 '12 at 15:27
  • @AaronBertrand I setup the OP's query on my system and it ran the subquery first (making a single row result) and then nestedloop join (1 execution) the rest of the query. The result was shown as having "Table 'Person'. Scan count 2". I can definitively say that the subquery is not running on "each row". – Amy B Nov 29 '12 at 15:52
  • @DavidB I didn't say that *would* be the result - I merely commented on what it appears that he is *trying* to do. – Aaron Bertrand Nov 29 '12 at 15:53
  • This case might be worthy of considering an indexed view, e.g. `SELECT Born, COUNT_BIG(*) FROM Person` – Aaron Bertrand Nov 29 '12 at 15:55
  • Thanks for the suggestion. Tried the COUNT OVER as @t-clausen.dk suggested below but no luck. Anything else I could try? – mrt Nov 29 '12 at 16:00

4 Answers4

11

Updated following discussion in comments

The cause of the problem here is very low cardinality of the IX_Person_Born index.

SQL indexes are very good at quickly narrowing down values, but they have problems when you have lots of records with the same value.

You can think of it as like the index of a phone book - if you want to find "Smith, John" you first find that there are lots of names that begin with S, and then pages and pages of people called Smith, and then lots of Johns. You end up scanning the book.

This is compounded because the index in the phone book is clustered - the records are sorted by surname. If instead you want to find everyone called "John" you'll be doing a lot of looking up.

Here there are 30 million records but only 30 different values, which means that the best possible index is still returning around 1 million records - at that sort of scale it might as well be a table-scan. Each of those 1 million results is not the actual record - it's a lookup from the index to the table (the page number in the phone book analogy), which makes it even slower.

A high cardinality index (say for full date of birth), rather than year would be much quicker.

This is a general problem for all OLTP relational databases: low cardinality + huge datasets = slow queries because index-trees don't help much.

In short: there's no significantly quicker way to get the count using T-SQL and indexes.

You have a couple of options:

1. Data Aggregation

Either OLAP/Cube rollups or do it yourself:

select Born, count(*) 
from Person 
group by Born

The pro is that cube lookups or checking your cache is very fast. The problem is that the data will get out of date and you need some way to account for that.

2. Parallel Queries

Split into two queries:

SELECT count(*) 
FROM Person 
WHERE Born = '1970'

SELECT TOP 30 *
FROM Person 
WHERE Born = '1970'

Then run these either in parallel server side, or add it to the user interface.

3. No-SQL

This problem is one of the big advantages no-SQL solutions have over traditional relational databases. In a no-SQL system the Person table is federated (or sharded) across lots of cheap servers. When a user searches every server is checked at the same time.

At this point a technology change is probably out, but it may be worth investigating so I've included it.

I have had similar problems in the past with databases of this kind of size, and (depending on context) I've used both options 1 and 2. If the total here is for paging then I'd probably go with option 2 and AJAX call to get the count.

Keith
  • 150,284
  • 78
  • 298
  • 434
  • Good suggestion about splitting the queries. Will most likely even do 2 separate SP and cache the results separately, so the slow count have effect while loading the first page only. I'm sure the count is the slow part - check out my comment under @David B answer. – mrt Nov 29 '12 at 15:57
  • From the execution plan: 74% of the query cost in Index Seek on IX_Person_Born – mrt Nov 29 '12 at 16:07
  • @MRT - I would expect a subset of 30 rows to return a lot quicker than the count, but 1.8 seconds is awful slow for the count, even with millions of records. I assume what you need the count for is total number of pages, in which case you don't need to get it at the same time as the paged data - get the count in an async/AJAX call (or similar) and show a loading icon over the paging controls while you do so. The page will 'feel' fast to users, even if they have to wait for a second or so before they can see totals. – Keith Nov 29 '12 at 16:08
  • That's a great suggestion with the UI. Sill I wouldn't mind finding the faster way of counting the results. Theoretically seems it should be possible if the only query condition have index on it... – mrt Nov 29 '12 at 16:14
  • 1
    @MRT If the majority of the work is on `IX_Person_Born` then you may have the right plan, but it's the index that needs looking at - fragmentation and out of date statistics can cause indexes to be slow. Part of your problem may be due to the cardinality vs specificity of your data - the with only 30 values for 30 million records each index page covers about a million records, meaning that the index adds a lot less value that one on a unique or near-unique value might. You can think of it like looking up a surname in the phone book and finding that 'Jones' covers 20 pages. – Keith Nov 29 '12 at 16:18
  • Thant's right, the fact there is only 30 different values in the index is the reason for it being slow (have separate indexes on other more unique columns and they perform much better). I wouldn't look at statistics as I'm resetting all the caches before testing. Obviously they will be beneficial in production, and the idea is to get the query as fast as possible on the first execution. – mrt Nov 29 '12 at 16:22
  • @MRT if you're dealing with large amount of low-cardinality data then SQL may not be your best bet - it could be worth looking at specialised solutions (like bitmap indexes), MS Analysis Services (OLAP is all about quick analysis of huge sets) or a no-SQL solution like Big Table (query 30 cheap servers at once and get a count of the people born in 1970 from each). – Keith Nov 29 '12 at 16:24
  • Appreciate all the suggestions. – mrt Nov 29 '12 at 16:28
2
DECLARE @TotalPeople int
  --does this query run fast enough?  If not, there is no hope for a combo query.
SET @TotalPeople = (SELECT count(*) FROM Person WHERE Born = '1970')


WITH PersonSubset as
(
    SELECT *, ROW_NUMBER() OVER (ORDER BY Born asc) AS Row
    FROM Person WITH (INDEX(IX_Person_Born)) 
    WHERE Born = '1970'
)
SELECT *, @TotalPeople as TotalPeople
FROM PersonSubset
WHERE Row BETWEEN 0 AND 30

You usually can't take a slow query, combine it with a fast query, and wind up with a fast query.


One of the column 'Born' have not more than 30 different values and there is an index defined on it.

Either SQL Server isn't using the index or statistics, or the index and statistics aren't helpful enough.

Here is a desperate measure that will force Sql's hand (at the potential cost of making writes very expensive - measure that, and blocking schema changes to the Person table while the view exists).

CREATE VIEW dbo.BornCounts WITH SCHEMABINDING
AS
SELECT Born, COUNT_BIG(*) as NumRows
FROM dbo.Person
GROUP BY Born

GO 

CREATE UNIQUE CLUSTERED INDEX BornCountsIndex ON BornCounts(Born)

By putting a clustered index on a view, you make it a system maintained copy. The size of this copy is much smaller than 30 Million rows, and it has the exact information you're looking for. I did not have to change the query to get it to use the view, but you're free to use the view's name in the query if you like.

Amy B
  • 108,202
  • 21
  • 135
  • 185
  • The original query takes 2017ms. Count only takes 1831ms. The original without the total count: 632ms. So you recon there's no way of speeding it up? – mrt Nov 29 '12 at 15:53
  • The view should be considered automatically on Developer/Enterprise/Evaluation. You may need to use the view explicitly (and use `NOEXPAND`) on lower editions. P.S. novel idea! – Aaron Bertrand Nov 29 '12 at 16:18
  • Like the 'desperate measure' idea. Definitely going to test it if I wont find any 'query time' improvement – mrt Nov 29 '12 at 16:25
  • @AaronBertrand your comment about indexed view was posted at the same time I was testing the possibility. This gives me some comfort - there's a common professional approach to these kinds of problems - we're not just making stuff up. – Amy B Nov 29 '12 at 19:03
1
WITH PersonSubset as
(
    SELECT *, ROW_NUMBER() OVER (ORDER BY Born asc) AS Row
    FROM Person WITH (INDEX(IX_Person_Born)) 
    WHERE Born = '1970'
)
SELECT *, **max(Row) AS TotalPeople**
FROM PersonSubset
WHERE Row BETWEEN 0 AND 30

why not like that ?

edit , dont know why bold doesnt work :<

WKordos
  • 2,167
  • 1
  • 16
  • 15
1

Here is a novel approach using system dmv's if you can get by with a "good enough" count, you don't mind creating an index for every distinct value for [Born], and you don't mind feeling a little bit dirty inside.

Create a filtered index for each year:

--pick a column to index, it doesn't matter which.    
CREATE INDEX IX_Person_filt_1970 on Person ( id )  WHERE Born = '1970'
CREATE INDEX IX_Person_filt_1971 on Person ( id )  WHERE Born = '1971'
CREATE INDEX IX_Person_filt_1972 on Person ( id )  WHERE Born = '1972'

Then use the [rows] column from sys.partitions to to get a rowcount.

WITH PersonSubset as
(
    SELECT *, ROW_NUMBER() OVER (ORDER BY Born asc) AS Row
    FROM Person WITH (INDEX(IX_Person_Born)) 
    WHERE Born = '1970'
)
SELECT *, 
    (
    SELECT sum(rows) 
    FROM sys.partitions p 
        inner join sys.indexes i on p.object_id = i.object_id and p.index_id =i.index_id 
        inner join sys.tables t on t.object_id = i.object_id 
    WHERE t.name ='Person' 
        and i.name = 'IX_Person_filt_' + '1970' --or at @p1 
    )  AS TotalPeople
FROM PersonSubset
WHERE Row BETWEEN 0 AND 30

Sys.partitions isn't guaranteed to be accurate in 100% of cases (usually it is exact or really close) This approach won't work if you need to filter on anything but [Born]

StrayCatDBA
  • 2,740
  • 18
  • 25