Ranking with millions of entries

Question

I'm working on a server for an online game which should be able to handle millions of players. Now the game needs leaderboards and wants to be able to show a players current position and possibly other players near the current players position as well as the positions of the players friends.

Now I've done this stuff before in MySQL and I know how it's technically possible, however I figured since this is a common practice for a lot of online games there must be existing libraries or databases particularly for this purpose?

Can anyone advice me what database is the best for these types of queries and possibly any pre-existing libraries that already do a lot of this work? A third party service with API access would be fine too.

Hope to get some good advice, thanks!

Edit:

To clarify, I need a database which can hold millions of entries (so far MySQL is good for that) with which I can easily get ranked results. For example if I get a specific row from the "leaderboard" table I need to know which rank that row has. This query has to be under 500ms regardless of the size of the db.

Alternatively a way to update the table with the current ranking information would be fine too long as this update query does not lock the whole table and the update query runs in under 30 seconds.

Any ideas as to what database / mechanism or third party service to use would be much appreciated!

can you post your schema and how does your current scoring system work ?? TIA — Jon Black, Mar 30 '11 at 10:14
also can you tell me how many concurrent players you expect to have max usage ? — Jon Black, Mar 30 '11 at 11:32
Hi f00, the schema is irrelevant and is in service of the end purpose - which is to get a specific ranking between millions of entries in under 500ms. As for concurrent users, this will kind of scale with the amount of entries, 1 million entries would equal (very rough estimate) about 10.000 concurrent users. — Naatan, Mar 31 '11 at 17:07
So you dont mind if i have (game_id, round_id, player_id) as a composite primary key. I'm guessing you might have more than one round in the life cycle of the game - but hey, as it's irrelevant (lol) i guess it doesnt matter. — Jon Black, Mar 31 '11 at 23:26
hey f00. Not sure what you mean by round, I guess you are talking about some sort of level mechanic? Either way that doesnt really matter, the only 3 fields that matter are id_game, id_user and score. — Naatan, Apr 01 '11 at 03:28

score 35 · Accepted Answer · answered Mar 30 '11 at 19:13

A single disk seek is about 15ms, maybe a little less with server grade disks. A response time of less than 500ms limits you to about 30 random disk accesses. That is not a lot.

On my tiny laptop, I have a development database with

root@localhost [kris]> select @@innodb_buffer_pool_size/1024/1024 as pool_mb;
+--------------+
| pool_mb      |
+--------------+
| 128.00000000 |
+--------------+
1 row in set (0.00 sec)

and a slow laptop disk. I created a score table with

root@localhost [kris]> show create table score\G
*************************** 1. row ***************************
       Table: score
Create Table: CREATE TABLE `score` (
  `player_id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `score` int(11) NOT NULL,
  PRIMARY KEY (`player_id`),
  KEY `score` (`score`)
) ENGINE=InnoDB AUTO_INCREMENT=2490316 DEFAULT CHARSET=latin1
1 row in set (0.00 sec)

with random integer scores and sequential player_id values. We have

root@localhost [kris]> select count(*)/1000/1000 as mrows from score\G
*************************** 1. row ***************************
mrows: 2.09715200
1 row in set (0.39 sec)

The database maintains the pair (score, player_id) in score order in the index score, as data in an InnoDB index is stored in a BTREE, and the row pointer (data pointer) is the primary key value, so that the definition KEY (score) ends up being KEY(score, player_id) internally. We can prove that by looking at the query plan for a score retrieval:

root@localhost [kris]> explain select * from score where score = 17\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: score
         type: ref
possible_keys: score
          key: score
      key_len: 4
          ref: const
         rows: 29
        Extra: Using index
1 row in set (0.00 sec)

As you can see, the key: score is being used with Using index, meaning that no data access is necessary.

The ranking query for a given constant player_id takes precisely 500ms on my laptop:

root@localhost [kris]>  select p.*, count(*) as rank 
    from score as p join score as s on p.score < s.score 
   where p.player_id = 479269\G
*************************** 1. row ***************************
player_id: 479269
    score: 99901
     rank: 2074
1 row in set (0.50 sec)

With more memory and on a faster box it can be quicker, but it is still a comparatively expensive operation, because the plan sucks:

root@localhost [kris]> explain select p.*, count(*) as rank from score as p join score as s on p.score < s.score where p.player_id = 479269;
+----+-------------+-------+-------+---------------+---------+---------+-------+---------+--------------------------+
| id | select_type | table | type  | possible_keys | key     | key_len | ref   | rows    | Extra                    |
+----+-------------+-------+-------+---------------+---------+---------+-------+---------+--------------------------+
|  1 | SIMPLE      | p     | const | PRIMARY,score | PRIMARY | 4       | const |       1 |                          |
|  1 | SIMPLE      | s     | index | score         | score   | 4       | NULL  | 2097979 | Using where; Using index |
+----+-------------+-------+-------+---------------+---------+---------+-------+---------+--------------------------+
2 rows in set (0.00 sec)

As you can see, the second table in the plan is an index scan, so the query slows down linearly with the number of players.

If you want a full leaderboard, you need to leave off the where clause, and then you get two scans and quadratic execution times. So this plan implodes completely.

Time to go procedural here:

root@localhost [kris]> set @count = 0; 
    select *, @count := @count + 1 as rank from score where score >= 99901 order by score desc ;
...
|   2353218 | 99901 | 2075 |
|   2279992 | 99901 | 2076 |
|   2264334 | 99901 | 2077 |
|   2239927 | 99901 | 2078 |
|   2158161 | 99901 | 2079 |
|   2076159 | 99901 | 2080 |
|   2027538 | 99901 | 2081 |
|   1908971 | 99901 | 2082 |
|   1887127 | 99901 | 2083 |
|   1848119 | 99901 | 2084 |
|   1692727 | 99901 | 2085 |
|   1658223 | 99901 | 2086 |
|   1581427 | 99901 | 2087 |
|   1469315 | 99901 | 2088 |
|   1466122 | 99901 | 2089 |
|   1387171 | 99901 | 2090 |
|   1286378 | 99901 | 2091 |
|    666050 | 99901 | 2092 |
|    633419 | 99901 | 2093 |
|    479269 | 99901 | 2094 |
|    329168 | 99901 | 2095 |
|    299189 | 99901 | 2096 |
|    290436 | 99901 | 2097 |
...

Because this is a procedural plan, it is unstable:

You cannot use LIMIT, because that will offset the counter. Instead you have to download all this data.
You cannot really sort. This ORDER BY clause works, because it does not sort, but uses an index. As soon as you see using filesort, the counter values will be wildly off.

It is the solution that comes closest to what a NoSQL (read: procedural) database will do as an execution plan, though.

We can stabilize the NoSQL inside a subquery and then slice out the part that is of interest to us, though:

root@localhost [kris]> set @count = 0; 
    select * from ( 
        select *, @count := @count + 1 as rank 
          from score 
         where score >= 99901 
      order by score desc 
    ) as t 
    where player_id = 479269;
Query OK, 0 rows affected (0.00 sec)
+-----------+-------+------+
| player_id | score | rank |
+-----------+-------+------+
|    479269 | 99901 | 2094 |
+-----------+-------+------+
1 row in set (0.00 sec)

root@localhost [kris]> set @count = 0; 
    select * from ( 
        select *, @count := @count + 1 as rank 
          from score 
         where score >= 99901 
      order by score desc 
    ) as t 
    where rank between 2090 and 2100;
Query OK, 0 rows affected (0.00 sec)
+-----------+-------+------+
| player_id | score | rank |
+-----------+-------+------+
|   1387171 | 99901 | 2090 |
|   1286378 | 99901 | 2091 |
|    666050 | 99901 | 2092 |
|    633419 | 99901 | 2093 |
|    479269 | 99901 | 2094 |
|    329168 | 99901 | 2095 |
|    299189 | 99901 | 2096 |
|    290436 | 99901 | 2097 |
+-----------+-------+------+
8 rows in set (0.01 sec)

The subquery will materialize the former result set as an ad-hoc table named t, which we then can access in the outer query. Because it is an ad-hoc table, in MySQL it will have no index. This limits what is possible efficiently in the outer query.

Note how both queries satisfy your timing constraint, though. Here is the plan:

root@localhost [kris]> set @count = 0; explain select * from ( select *, @count := @count + 1 as rank from score where score >= 99901 order by score desc ) as t where rank between 2090 and 2100\G
Query OK, 0 rows affected (0.00 sec)

*************************** 1. row ***************************
           id: 1
  select_type: PRIMARY
        table: <derived2>
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 2097
        Extra: Using where
*************************** 2. row ***************************
           id: 2
  select_type: DERIVED
        table: score
         type: range
possible_keys: score
          key: score
      key_len: 4
          ref: NULL
         rows: 3750
        Extra: Using where; Using index
2 rows in set (0.00 sec)

Both query components (the inner, DERIVED query and the outer BETWEEN constraint) will get slower for badly ranked players, though, and then grossly violate your timing constraints.

root@localhost [kris]> set @count = 0; select * from ( select *, @count := @count + 1 as rank from score where score >= 0 order by score desc ) as t;
...
2097152 rows in set (3.56 sec)

The execution time for the descriptive approach is stable (dependent only on table size):

root@localhost [kris]> select p.*, count(*) as rank 
   from score as p join score as s on p.score < s.score 
   where p.player_id = 1134026;
+-----------+-------+---------+
| player_id | score | rank    |
+-----------+-------+---------+
|   1134026 |     0 | 2097135 |
+-----------+-------+---------+
1 row in set (0.53 sec)

Your call.

Thanks for the awesome response lsotopp! Your numbers seem to match mine pretty well. I've gone through this many times and sought out ways to optimize but it seems there just isn't any way around it. What I'm going to do is basically use your suggested method but fallback on rank "estimation" if the score is below a certain number, obviously not ideal but if your rank is #5000 it doesnt matter much if its off by a little. Anyway, thanks a lot for the elaborate response! Incredibly helpful! — Naatan, Mar 31 '11 at 17:01
You're welcome. The article is a variation of a theme which I first touched with help from two MySQL AB colleagues, Kai Voigt (now Cloudera) and Jan Kneschke (now Oracle). See http://mysqldump.azundris.com/archives/32-The-Quota-Query-and-Running-Sums-by-Jan-and-Kai.html for the original article. — Isotopp, Mar 31 '11 at 18:06
It's a bit confusing: you used "score" for the table name and the column name so it's hard to distinguish both in (relatively complex) queries. — aberaud, Oct 30 '12 at 22:34
When you have 50M rows and you are querying the last player, whose score is just zero, will the subquery be as fast as querying the top 1000 players? — Daniel, Dec 24 '14 at 02:33
instead of giving the range manually, is it possible to give the range using the current player rank, like player_rank -2 and player_rank + 2 — Kumar KS, Dec 05 '17 at 04:38

Jeff Ferland · Answer 2 · 2014-04-11T00:25:02.343

I know this is an old question, but I do enjoy staring at such problems. Given the ratio of data -> query speed required, some non-traditional tricks can be used that take more coding work but can really give a boost to query performance.

Scoring buckets

To begin with, we should track scores with buckets. We want the bucket list (what a great name!) to be small enough to easily hold in memory, and large enough that buckets aren't frequently (relatively speaking) being affected. That provides us with greater concurrency to avoid locking issues.

You'll have to judge how to split up those buckets based upon your load, but I think you want to focus on having as many buckets as you can that will easily fit into memory and add quickly.

To accommodate this, my score_buckets table will have the following structure:

minscore, maxscore, usercount; PK(minscore, maxscore)

User table

We must track our users, and probably going to be done with:

userid, score, timestamp
#(etc., etc. that we don't care about for this part of the problem)

In order to efficiently iterate over this to get a count by score, we need an index on the score. Timestamp is just something I threw in for tie-breaking in my example so that I'd have a definitive ordering. If you don't need it, ditch it -- it is using space and that will affect query time. At the moment: index(score, timestamp).

Inserting / Updating / Deleting users and their scores

Add triggers to the user table. On insertion:

update score_buckets sb
    set sb.usercount = sb.usercount + 1
    where sb.minscore <= NEW.score
    and sb.maxscore >= NEW.score

On update

update score_buckets sb
    set sb.usercount = sb.usercount - 1
    where sb.minscore <= OLD.score
    and sb.maxscore >= OLD.score
update score_buckets sb
    set sb.usercount = sb.usercount + 1
    where sb.minscore <= NEW.score
    and sb.maxscore >= NEW.score

On deletion

update score_buckets sb
    set sb.usercount = sb.usercount - 1
    where sb.minscore <= OLD.score
    and sb.maxscore >= OLD.score

Determining Rank

$usersBefore = select sum(usercount)
    from score_buckets
    where maxscore < $userscore;
$countFrom = select max(maxscore)
    from score_buckets
    where maxscore < $userscore;
$rank = select count(*) from user
    where score > $countFrom
    and score <= $userscore
    and timestamp <= $userTimestamp

Closing notes

Benchmark with various numbers of buckets, doubling or halving them each time. You can quickly write up a bucket doubling / halving script to allow you to load test this. More buckets makes for less scanning of the user score index and less lock / transaction contention when updating scores. More buckets consumes more memory. To pick a number to start with, use 10,000 buckets. Ideally, your buckets will cover the entire range of scores and each bucket will have roughly the same number of users counted in it. If you score distribution graph follows a curve of some kind, make your bucket distribution follow that curve.

The theory of this is kind related to a two-tiered skip list.

My two cents, if too many users have the same score, even they exceeds the threshold but you still cannot split the bucket. This happens when a game is not very popular and many players quit within 10 mins, but leave a initial score like 100. So, it's better to build the buckets with both score and ts. score_buckets would have three columns, score_ts_from, score_ts_to, and count. — Daniel, Dec 24 '14 at 02:48

score 5 · Answer 3 · answered Mar 25 '11 at 18:14

5

I've read an article recently on solving this kind of problem with Redis. You could still use MySQL as your basic store, but you would cache the unsorted results in Redis and update the rankings in real time. The link can be found here. The last third of the article is about keyed sorts, like you'd have with a rankings list.

answered Mar 25 '11 at 18:14

David Richards

955
6
10

Thanks David, unfortunately that doesn't really work since my leaderboards can hold millions of entries. Bit much to cache that. I can get the ranking with sql queries but those queries will run longer the bigger the table gets. I need something that will scale with larger DB's without having the query execution time increase too much. – Naatan Mar 28 '11 at 17:12
@Naatan: I Support David answer, NoSQL is good way to handle it smoothly. It is "something that will scale with larger DB's without having the query execution time increase too much." If you have database, which scales for millions users, having parallel Redis server for some user params should be no problem. – w.k Mar 31 '11 at 15:43
@Naatan: you should probably check out that article. The discussion is about using Redis for a problem set of millions, which is what you're describing. Maybe one way to think of it is MySQL is optimized for set operations and Redis is optimized for fast caching of very large data sets. – David Richards Mar 31 '11 at 21:50
Thanks David, I will give it a go once I have my initial usable version running which should work up till about 5 million entries, which we won't be reaching for the next few months. – Naatan Apr 01 '11 at 03:30

tux21b · Answer 4 · 2011-03-28T22:55:47.300

Sorting millions of entries might sound like a lot of work, but it clearly isn't. Sorting 10^6 completely random entries takes about 3 seconds on my computer (just an older EeePC with an Atom CPU (first generation i think), 1.6GHz).

And with a good sorting algorithm, sorting has O(n*log(n)) in the worst case, so it wont really matter if you have 10^9 or more entries. And most of the time the rank list will be already nearly sorted (from a previous ranking) resulting in a runtime which is more likely to be O(n).

So, stop worrying about it! The only real problem is, that most DBMSs can not directly access the 1000th entry. So, a query like SELECT ... LIMIT 1000, 5 will have to query at least 1005 entries and skip the first 1000. But the solution here is simply too. Just store the rank as an redundant column of each row, add an index to it and compute it every 15min (or every 5min, 30min, 1h, or whatever makes sense for your application). With that, all queries by rank are just simply secondary index lookups (about O(log(N))) which is extremely fast and will only take some milliseconds per query (the network is here the bottleneck, not the database).

PS: You commented on another answer that you can not cache the sorted entries because they are too large for your memory. Assuming that you just cache (user_id, rank) tuples with two 64 bit integers (32 bits would be more than enough too!), you would need less than 8 MB of memory to store 10^6 entries. Are you sure you do not have enough RAM for that?

So, please do not try to optimize something which is clearly not a bottleneck (yet)...

Thanks for the response tux21b! Very informative! Unfortunately to do as you say and to update every 15, 30 or 60 minutes would still cause the query to run for about 1 minute to update all the rows with their new ranks, which will still be ever increasing as the database grows. I might have to resort to it but I'd strongly prefer not to run such heavy queries in a cron job. I'll look into caching the results though, that might be a good solution. — Naatan, Mar 29 '11 at 14:00
How about storing the rankings in their own table (user_id, rank, index on rank)? Every N minutes, empty and reload. — Philip Kelley, Mar 30 '11 at 16:18

score 1 · Answer 5 · answered Mar 28 '11 at 17:46

1

You can redundantly store the rank of each player in the player table so that you don't have to do any join operations. Every time, when the leaderboards are recalculated, the player tables should be updated, too.

answered Mar 28 '11 at 17:46

Alp

29,274
27
120
198

I know I can but this update takes forever to execute with millions of entries, lest you know a database / update mechanism that significantly reduces the execution time? – Naatan Mar 28 '11 at 18:42
You could try out [TRIGGERs](http://dev.mysql.com/doc/refman/5.1/de/create-trigger.html) – Alp Mar 28 '11 at 18:45
2

How would a trigger reduce the execution time of the query that would get triggered? Also I cant simply update the rank of the row that was added / updated as this rank would offset all of the ranks of the rows coming after it. – Naatan Mar 28 '11 at 18:58

score 0 · Answer 6 · answered Nov 30 '11 at 18:24

I can think of two ways to approach this problem:

First approach: Update in batches:

Sort the scores, obtain the ranking
Divide the players by rank into batches like player0-player999, player1000-player1999, etc
For each batch, delete entries in the existing table that conflict with the new data. This means deleting existing entries belonging to players in the current batch or who currently rank in the range of ranks being updated in the current batch. Then you load the ranking data for the batch into the database, and jump to the next batch after say 0.1s.

Second approach: New table

Create (or truncate) a new table just like your existing ranking table.
compute the new ranking and insert your data
Swap the tables (after preferably locking them). This should take less than a second.