0

I want grouped ranking on a very large table, I've found a couple of solutions for this problem e.g. in this post and other places on the web. I am, however, unable to figure out the worst case complexity of these solutions. The specific problem consists of a table where each row has a number of points and a name associated. I want to be able to request rank intervals such as 1-4. Here are some data examples:

name | points
Ab     14
Ac     14
B      16
C      16
Da     15
De     13

With these values the following "ranking" is created:

Query id | Rank | Name
1          1      B
2          1      C
3          3      Da
4          4      Ab
5          4      Ac
6          6      De

And it should be possible to create the following interval on query-id's: 2-5 giving rank: 1,3,4 and 4.

The database holds about 3 million records so if possible I want to avoid a solution with complexity greater than log(n). There are constantly updates and inserts on the database so these actions should preferably be performed in log(n) complexity as well. I am not sure it's possible though and I've tried wrapping my head around it for some time. I've come to the conclusion that a binary search should be possible but I haven't been able to create a query that does this. I am using a MySQL server.

I will elaborate on how the pseudo code for the filtering could work. Firstly, an index on (points, name) is needed. As input you give a fromrank and a tillrank. The total number of records in the database is n. The pseudocode should look something like this:

Find median point value, count rows less than this value (the count gives a rough estimate of rank, not considering those with same amount of points). If the number returned is greater than the fromrank delimiter, we subdivide the first half and find median of it. We keep doing this until we are pinpointed to the amount of points where fromrank should start. then we do the same within that amount of points with the name index, and find median until we have reached the correct row. We do the exact same thing for tillrank.

The result should be log(n) number of subdivisions. So given the median and count can be made in log(n) time it should be possible to solve the problem in worst case complexity log(n). Correct me if I am wrong.

Community
  • 1
  • 1
Per Stilling
  • 868
  • 2
  • 9
  • 19
  • Glad my post came in handy. Um, did you try the second solution? Using group_concat? – achinda99 Feb 16 '09 at 19:21
  • I think that method has complexity n if I'm not mistaking, besides that I don't see how it can be modified reasonably easily to support retrieving any range of ranks. – Per Stilling Feb 16 '09 at 20:03
  • Unfortunately, the count itself is the most costly operation here, the time will depend on the rows actually counted, not on the rows searched, so it will still be O(N) – Quassnoi Feb 17 '09 at 08:25

1 Answers1

2

You need a stored procedure to be able to call this with parameters:

CREATE TABLE rank (name VARCHAR(20) NOT NULL, points INTEGER NOT NULL);

CREATE INDEX ix_rank_points ON rank(points, name);

CREATE PROCEDURE prc_ranks(fromrank INT, tillrank INT)
BEGIN
  SET @fromrank = fromrank;
  SET @tillrank = tillrank;
  PREPARE STMT FROM
  '
  SELECT  rn, rank, name, points
  FROM  (
    SELECT  CASE WHEN @cp = points THEN @rank ELSE @rank := @rn + 1 END AS rank,
            @rn := @rn + 1 AS rn,
            @cp := points,
            r.*
    FROM (
         SELECT @cp := -1, @rn := 0, @rank = 1
         ) var,
         (
         SELECT *
         FROM rank
         FORCE INDEX (ix_rank_points)
         ORDER BY
           points DESC, name DESC
         LIMIT ?
         ) r
    ) o
  WHERE rn >= ?
  ';
  EXECUTE STMT USING @tillrank, @fromrank;
END;

CALL prc_ranks (2, 5);

If you create the index and force MySQL to use it (as in my query), then the complexity of the query will not depend on the number of rows at all, it will depend only on tillrank.

It will actually take last tillrank values from the index, perform some simple calculations on them and filter out first fromrank values.

Time of this operation, as you can see, depends only on tillrank, it does not depend on how many records are there.

I just checked in on 400,000 rows, it selects ranks from 5 to 100 in 0,004 seconds (that is, instantly)

Important: this only works if you sort on names in DESCENDING order. MySQL does not support DESC clause in the indices, that means that the points and name must be sorted in one order for INDEX SORT to be usable (either both ASCENDING or both DESCENDING). If you want fast ASC sorting by name, you will need to keep negative points in the database, and change the sign in the SELECT clause.

You may also remove name from the index at all, and perform a final ORDER'ing without using an index:

CREATE INDEX ix_rank_points ON rank(points);

CREATE PROCEDURE prc_ranks(fromrank INT, tillrank INT)
BEGIN
  SET @fromrank = fromrank;
  SET @tillrank = tillrank;
  PREPARE STMT FROM
  '
  SELECT  rn, rank, name, points
  FROM  (
    SELECT  CASE WHEN @cp = points THEN @rank ELSE @rank := @rn + 1 END AS rank,
            @rn := @rn + 1 AS rn,
            @cp := points,
            r.*
    FROM (
         SELECT @cp := -1, @rn := 0, @rank = 1
         ) var,
         (
         SELECT *
         FROM rank
         FORCE INDEX (ix_rank_points)
         ORDER BY
           points DESC
         LIMIT ?
         ) r
    ) o
  WHERE rn >= ?
  ORDER BY rank, name
  ';
  EXECUTE STMT USING @tillrank, @fromrank;
END;

That will impact performance on big ranges, but you will hardly notice it on small ranges.

Quassnoi
  • 413,100
  • 91
  • 616
  • 614
  • Looks very nice, what is the complexity of this query, is it within log(n) range and if so can you explain how come. One thing missing though is sorting on the name as 2nd priority if two rows have the same amount of points. – Per Stilling Feb 16 '09 at 20:48
  • The algorithm my rank is counted on is that the ones with most points have rank 1 and if some people have the same amount of points I want them sorted according to their name. So if two people have max points,the row with third most points is recorded as rank 3, while the two others both have rank 1 – Per Stilling Feb 16 '09 at 20:50
  • I did some testing, and as far as I can see the time does increase as soon as you choose small intervals in the high numbers. – Per Stilling Feb 16 '09 at 22:11
  • I think it is due to the limit clause, are you sure it can be optimized. I thought limit was a convenience clause and wasnt optimized in any way. I did a 50 record request at offset 90000-90050 and took a second, thats a bit slow computer though, but it took 200ms getting rank 1-50. – Per Stilling Feb 16 '09 at 22:11
  • That's exactly what I said, read my post carefully. It depends NOT on the interval itself, but on the tillrank (higher or your bounds). It cannot be further optimized, because when you need to take records from 999,900 to 1,000,000, you need to look on whole million. – Quassnoi Feb 16 '09 at 22:28
  • And did you create the index and forced it into the query? – Quassnoi Feb 16 '09 at 22:29
  • Ah, and ordering by name also matters. The index will work only if points and names are sorted in one direction, as MySQL does not support DESC clause in indexes. You'll need to keep NEGATIVE points in the database, and change sign in the query if you want to sort on names in ascending order. – Quassnoi Feb 16 '09 at 22:32
  • Hmm ok, I understood this as if it didnt matter how records there were, but how many you want to fetch: "Time of this operation, as you can see, depends only on how many records you need, it does not depend on how many records are there." – Per Stilling Feb 16 '09 at 22:34
  • I created your algorithm on my data, and created the index you also created and forced it exactly like you do. I do, however, think it should be possible to do the operation in log(n) time. I will elaborate in my question how I think it can be implemented. – Per Stilling Feb 16 '09 at 22:39
  • Did you change the names ordering from DESC to ASC? And what MySQL version are you using? – Quassnoi Feb 16 '09 at 22:42
  • If you use index, this operation will take LOG(n) time. N here means higher bound of your range, not the overall number of records. – Quassnoi Feb 16 '09 at 22:43
  • Ok, I will always be using a constant range, my problem is scaling according to the total number of records. I've posted an explanation of the solution I assume could work in log(n) time. I have not changed the name ordering and I believe you that within query range the query you made work in log n – Per Stilling Feb 16 '09 at 22:51
  • See updated post. It's very strange that you get 200 ms on (1, 50), with an index this should work instantly. I'm checking on MySQL 5.1.28, maybe FORCE support is broken in 5.0.51. – Quassnoi Feb 16 '09 at 23:01
  • Try to issue this: SELECT * FROM rank FORCE INDEX (ix_rank_points) LIMIT 10 and see how long does it take. It should return 10 your LEAST points and should take less than 10 milliseconds. If it doesn't, then it's something with FORCE INDEX . – Quassnoi Feb 16 '09 at 23:03
  • Here is my database, maybe the fact that it is innodb makes it slower. http://pastebin.com/m655e99ea – Per Stilling Feb 16 '09 at 23:09
  • Sorry if it's a stupid question, but did you replace "ix_rank_points" with "points2" in the query? – Quassnoi Feb 16 '09 at 23:11
  • That query takes 0.0003 seconds – Per Stilling Feb 16 '09 at 23:11
  • However, I figured it out why it did 200ms, it's the EMS SQL query tool I have, it adds traffic delay. So 200ms is probably instant – Per Stilling Feb 16 '09 at 23:13
  • And how long does this function takes on (0, 50) and on (200000, 200050)? – Quassnoi Feb 16 '09 at 23:14
  • Can you look at the pseudocode i wrote, do you know whether count and median can be made with logn or in constant time using indexes ? – Per Stilling Feb 16 '09 at 23:15
  • Post a new question, I'll look it tomorrow – Quassnoi Feb 16 '09 at 23:15
  • 100k takes about 800ms sec, dont have more than 100k in the database at the time being. The 0-50 is instant. – Per Stilling Feb 16 '09 at 23:19