6

We've got a wide table that we're currently trying to optimize. The table has 50 columns (stats) that we eventually want to rank in descending order. Currently there's well over 5 million rows.

We're looking for ways to optimize this table both in terms of reducing complexity and improving read speed. Write speed is also important to us, but read is more critical. The ranks of these statistics should be as close to real time as possible with an optimal solution being one that ranks quickly on a per request basis (new rows are being added all the time and we want to show ranks for these rows as soon as possible.)

We're currently evaluating whether or not a vertical table layout would be a.) more performant, and b.) easier to work with.

Because the stats that are being inserted are not necessarily well defined, it's easier for us if they aren't hard coded into the table (hence the preference for a vertical table structure.)

Here's a look at our current table structure and query:

CREATE TABLE Stats 
(
    Id BIGINT PRIMARY KEY NOT NULL,
    UserId INT,
    Name VARCHAR(32) NOT NULL,
    Value DECIMAL(10,4) DEFAULT ((0)) NOT NULL,
    UpdatedAt DATETIME
);

CREATE INDEX Leaderboard__index ON Stats (Name, Value DESC);

SELECT
    Id,
    Name,
    Value,
    RANK() OVER (PARTITION BY Name ORDER BY Value DESC) AS Rank
FROM 
    Stats
ORDER BY 
    Value DESC

Typically we'd either be searching for top N rows for any given stat (like a leaderboard), or we'd be selecting a single UserId and getting the rank of all stats associated with that UserId.

The data is of considerable size (as I mentioned above, because there's a lot of rows and a lot of columns, a vertical table structure might be in the range of 250 million rows and will continue to grow.)

We're looking to fetch this data as fast as possible on whatever hardware is required, seconds is our target, as we're currently in the minutes range.

In a test of the vertical table structure we've inserted over 400,000 rows of data and the query above takes a little less than 3 minutes (though it also only took about 18 seconds less to rank 10,000 rows.)

I'd love to hear any suggestions. Thanks for your time!

iLoch
  • 741
  • 3
  • 11
  • 32
  • 1
    Can you update your tags with the version of SQL Server and also mention which edition you're using (standard, enterprise etc)? Ta. – Liesel Jun 12 '16 at 07:14
  • Based on the above I would recommend creating another table that would maintain, say, top 100 rank user list per "name" with scores. This will significantly reduce the data size but will not allow ranking of users outside top 100. – Alex Jun 12 '16 at 07:20
  • @Alex Fetching the top 10,000 rows takes only a couple hundred milliseconds round trip from the server -- the top N table design is problematic for us because - based on our UI design - it's trivial to go from looking at the top 100 to viewing 100 rows at an offset of 100,000. – iLoch Jun 12 '16 at 07:46
  • @Alex To add to that though - we would consider writing the entire table into another sorted table if it were fast enough, at the expensive of these ranks not updating as often (we'd still like to do this as often as possible.) – iLoch Jun 12 '16 at 07:52
  • Well, if I understood the problem correctly, you want to rank a new score as a soon as it is inserted (or retrieved for display purposes). This means that you have to analyse all other user scores. So, essentially, you are doing a select of the whole group of records (thousands) everytime you display a user score. My Top 100 approach would only ever show top 100 scores and ignore the rest. My suggested approach will NOT work if you want to show ranking for any score in your list (not just top N). – Alex Jun 12 '16 at 07:56

1 Answers1

17

The index you have is not usefull for your window function because

1.To get ID column value, SQL may end up doing key lookups or even end up scanning whole other index if it crosses Tipping point.So your index may not be used at all.

2.You are ordering by val desc which requires a sort with no suitable index and may even endup spilling to TEMPDB

3.For one more interesting fragmenation aspect ,see below

Typically for a Window function to perform well,you will need a POCindex which means

P,O--Partition and order by columns should be in key clause
C--covering --columns you are including in select should be included

So for below query to work optimally.

SELECT
    Id,
    Name,
    Value,
    RANK() OVER (PARTITION BY Name ORDER BY Value DESC) AS Rank
FROM 
    Stats
ORDER BY 
    Value DESC

You will need below index

create index nci_test on dbo.table(name,value desc)
include(id)

There is one more issue with your index created with " value desc".

Normally in an index all values will be stored in Ascending order by default,but with this index you are asking to store in a reverse way which can cause logical fragmentation which can be seen from answer of Martin Smith here ..Pasting relevant terms from the answer here ...

If the index is created with keys descending but new rows are appended with ascending key values then you can end up with every page out of logical order. This can severely impact the size of the IO reads when scanning the table and it is not in cache

So few options..

1.Run the index rebuild based on your frequency to see if it helps

2.Changing the query to order by partition clause will eliminate the need for index to be created with "val desc" option

SELECT
        Id,
        Name,
        Value,
        RANK() OVER (PARTITION BY Name ORDER BY Value DESC) AS Rank
    FROM 
        Stats
    ORDER BY 
        name DESC

The above query doesnt need an index to be created like the one you created .you can change it like below..which also takes care of Fragmentation aspects noted above

CREATE INDEX Leaderboard__index ON Stats (Name, Value)
include(id);

References:
Microsoft SQL Server 2012 High-Performance T-SQL Using Window Functions

Community
  • 1
  • 1
TheGameiswar
  • 27,855
  • 8
  • 56
  • 94
  • Wow, thanks for this incredibly thoughtful response. I'm going to give it a try now. – iLoch Jun 12 '16 at 08:03
  • You're a wizard. `400000 rows retrieved starting from 1 in 17s 386ms (execution: 238ms, fetching: 17s 148ms)` – iLoch Jun 12 '16 at 08:13
  • Could the `include(id)` on both suggested indexes be removed from the answer, as it is redundant? See this stackoverflow post: https://stackoverflow.com/questions/5328179/good-doesnt-matter-bad-to-include-primary-key-in-covering-index – Copy and Paste Sep 17 '22 at 17:59