SQL: Select top N entries for a unique id and deduplicate

Question

I have a table that looks like this:

HorseID     RiderID     Total
121         1           2
5           1           150
600         1           20
30          2           500
5           3           10
600         3           10
34          3           10

Each horse can potentially have multiple riders, and visa versa. 'Total' refers to the total number of incidents with that combination of rider and horse. What I would like is a table with the top 2 HorseIDs in terms of total for each rider, deduplicated. (ie. the top horse for rider 1 was horse 5, as they have had 150 incidents together, compared with 20 and 2 for the other horses with rider 1). So what I would like as output here is:

HorseID
5
600
30

Note when there is less than 2 horses for a rider, it should just grab the top 1 instead. When it comes to rider 3, there is a draw so it picks the top 2 in terms of order - however, these are already on the list so are not added (the deduplicate part - also happy to deduplicate afterward in a separate step if that's faster). I wouldn't mind if this instead grabbed number 34 in a draw (so we had an extra horseid on the list), any method of resolving draws is fine.

This is a toy dataset, the real dataset is much bigger and I will be grabbing the top 200 for each RiderID, so a scale-able/efficient solution would be great.

How would I go about this?

you need to explain the top half better. In your head didn't make it to the screen. Then you need to show your code you have tried — Drew, Sep 15 '15 at 01:44

score 0 · Answer 1 · edited May 23 '17 at 12:29

This is just a slightly modified greatest-n-per-group question. MySQL isn't the best at greatest-n-per-group since it lacks analytic functions, but you can do something like this:

SELECT DISTINCT HorseID
FROM (SELECT HorseID, RiderID, Total,
        @num := IF(@group = RiderID, @num + 1, 1) AS row_number,
        @group := RiderID AS group
    FROM UnnamedTable
    ORDER BY RiderID, Total DESC, HorseID) ranked
WHERE row_number <= 2;

Beware that this does not handle ties. If the first three horses have a Total of 200, you still only get two rows. Notice that the third item in the ORDER BY of the subquery is HorseID. That's simply there to make the results deterministic in the case of a tie. You can handle ties by adding a little logic to the IF() expression and change it from like ROW_NUMBER() to like DENSE_RANK(). If you want it to be like RANK(), however, I think you need a third variable in there.

I'm not sure how well this scales, either.

You went all oracle on us at the end of that. Plus, you forgot to initialise your variables — Strawberry, Sep 15 '15 at 07:06
@Strawberry Oracle, Postgres, SQL Server, DB2, Sybase, Firebird, Informix. Many RDBMSs support analytic functions. MySQL is somewhat unusual for *not* doing so. SQLite doesn't, either, but that's not exactly surprising. — Bacon Bits, Sep 15 '15 at 13:56

SQL: Select top N entries for a unique id and deduplicate

1 Answers1