6

I have two tables: "servers" and "stats"

servers has a column called "id" that auto-increments. stats has a column called "server" that corresponds to a row in the servers table, a column called "time" that represents the time it was added, and a column called "votes" that I would like to get the average of.

I would like to fetch all the servers (SELECT * FROM servers) along with the average votes of the 24 most recent rows that correspond to each server. I believe this is a "greatest-n-per-group" question.

This is what I tried to do, but it gave me 24 rows total, not 24 rows per group:

SELECT servers.*,
       IFNULL(AVG(stats.votes), 0) AS avgvotes
FROM servers
LEFT OUTER JOIN
  (SELECT server,
          votes
   FROM stats
   GROUP BY server
   ORDER BY time DESC LIMIT 24) AS stats ON servers.id = stats.server
GROUP BY servers.id

Like I said, I would like to get the 24 most recent rows for each server, not 24 most recent rows total.

Jacob Brunson
  • 1,482
  • 4
  • 23
  • 37

3 Answers3

2

Thanks for this great post.

alter table add index(server, time)
 set @num:=0, @server:='';
select servers.*, IFNULL(AVG(stats.votes), 0) AS avgvotes
from servers left outer join (
select server, 
       time,votes, 
       @num := if(@server = server, @num + 1, 1) as row_number, 
       @server:= server as dummy 
from stats force index(server) 
group by server, time 
having row_number < 25) as stats 
on servers.id = stats.server
group by servers.id

edit 1

I just noticed that above query gives the oldest 24 records for each groups.

 set @num:=0, @server:='';
select servers.*, IFNULL(AVG(stats.votes), 0) AS avgvotes
from servers left outer join (
select server, 
       time,votes, 
       @num := if(@server = server, @num + 1, 1) as row_number, 
       @server:= server as dummy 
from (select * from stats order by server, time desc)  as t
group by server, time 
having row_number < 25) as stats 
on servers.id = stats.server
group by servers.id

which will give the average of the 24 newest entity for each group

Edit2

@DrAgonmoray you can try the inner query part first and see if it returns the the newest 24 records for each group. In my mysql 5.5, it works correctly.

select server, 
       time,votes, 
       @num := if(@server = server, @num + 1, 1) as row_number, 
       @server:= server as dummy 
from (select * from stats order by server, time desc)  as t
group by server, time 
having row_number < 25
lucemia
  • 6,349
  • 5
  • 42
  • 75
  • I'm getting a syntax error here: 'select servers.*, IFNULL(AVG(stats.votes), 0) AS avgvotes from servers lef' at line 2 – Jacob Brunson Jun 25 '12 at 20:46
  • 1
    @DrAgonmoray, put a `;` after the `alter table add...` line as well as the `set @num...` line as they are separate commands from actual query. – Zane Bien Jun 25 '12 at 20:53
  • Now the code works, however it appears to be giving me the average of all the records for each server, instead of just the last 24. I tested this using several different servers. – Jacob Brunson Jun 26 '12 at 00:21
1

This is another approach.

This query is going to suffer the same performance problems as other queries here that return correct results, because the execution plan for this query is going to require a SORT operation on EVERY row in the stats table. Since there is no predicate (restriction) on the time column, EVERY row in the stats table will be considered. For a REALLY large stats table, this is going to blow out all available temporary space before it dies a horrible death. (More notes on performance below.)

SELECT r.*
     , IFNULL(s.avg_votes,0)
  FROM servers r
  LEFT 
  JOIN ( SELECT t.server
              , AVG(t.votes) AS avg_votes
           FROM ( SELECT CASE WHEN u.server = @last_server 
                           THEN @i := @i + 1
                           ELSE @i := 1 
                         END AS i
                       , @last_server := u.server AS `server`
                       , u.votes AS votes
                    FROM (SELECT @i := 0, @last_server := NULL) i
                    JOIN ( SELECT v.server, v.votes
                             FROM stats v
                            ORDER BY v.server DESC, v.time DESC
                         ) u
                ) t
          WHERE t.i <= 24
          GROUP BY t.server
       ) s
    ON s.server = r.id

What this query is doing is sorting the stats table, by server and by descending order on the time column. (Inline view aliased as u.)

With the sorted result set, we assign a row numbers 1,2,3, etc. to each row for each server. (Inline view aliased as t.)

With that result set, we filter out any rows with a rownumber > 24, and we calculate an average of the votes column for the "latest" 24 rows for each server. (Inline view aliased as s.)

As a final step, we join that to the servers table, to return the requested resultset.


NOTE:

The execution plan for this query will be COSTLY for a large number of rows in the stats table.

To improve performance, there are several approaches we could take.

The simplest might to be include in the query a predicate the EXCLUDES a significant number of rows from the stats table (e.g. rows with time values over 2 days old, or over 2 weeks old). That would significantly reduce the number of rows that need to be sorted, to determine the "latest" 24 rows.

Also, with an index on stats(server,time), it's also possible that MySQL could do a relatively efficient "reverse scan" on the index, avoiding a sort operation.

We could also consider implementing an index on the stats table on (server,"reverse_time"). Since MySQL doesn't yet support descending indexes, the implementation would really be a regular (ascending) index on an a derived rtime value (a "reverse time" expression that is ascending for descending values of time (for example, -1*UNIX_TIMESTAMP(my_timestamp) or -1*TIMESTAMPDIFF('1970-01-01',my_datetime).

Another approach to improve performance would be to keep a shadow table containing the most recent 24 rows for each server. That would be simplest to implement if we can guarantee that "latest rows" won't be deleted from the stats table. We could maintain that table with a trigger. Basically, whenever a row is inserted into the stats table, we check if the time on the new rows is later than the earliest time stored for the server in the shadow table, if it is, we replace the earliest row in the shadow table with the new row, being sure to keep no more than 24 rows in the shadow table for each server.

And, yet another approach is to write a procedure or function that gets the result. The approach here would be to loop through each server, and run a separate query against the stats table to get the average votes for the latest 24 rows, and gather all of those results together. (That approach mighty really be more of a workaround to avoiding a sort on huge temporary set, just to enable the resultset to be returned, not necessarily making the return of the resultset blazingly fast.)

The bottom line for performance of this type of query on a LARGE table is restricting the number of rows considered by the query AND avoiding a sort operation on a large set. That's how we get a query like this to perform.


ADDENDUM

To get a "reverse index scan" operation (to get the rows from stats ordered using an index WITHOUT a filesort operation), I had to specify DESCENDING on both expressions in the ORDER BY clause. The query above previously had ORDER BY server ASC, time DESC, and MySQL always wanted to do a filesort, even specifying the FORCE INDEX FOR ORDER BY (stats_ix1) hint.

If the requirement is to return an 'average votes' for a server only if there are at least 24 associated rows in the stats table, then we can make a more efficient query, even if it is a bit more messy. (Most of the messiness in the nested IF() functions is to deal with NULL values, which do not get included in the average. It can be much less messy if we have a guarantee that votes is NOT NULL, or if we exclude any rows where votes is NULL.)

SELECT r.*
     , IFNULL(s.avg_votes,0)
  FROM servers r
  LEFT 
  JOIN ( SELECT t.server
              , t.tot/NULLIF(t.cnt,0) AS avg_votes
           FROM ( SELECT IF(v.server = @last_server, @num := @num + 1, @num := 1) AS num
                       , @cnt := IF(v.server = @last_server,IF(@num <= 24, @cnt := @cnt + IF(v.votes IS NULL,0,1),@cnt := 0),@cnt := IF(v.votes IS NULL,0,1)) AS cnt
                       , @tot := IF(v.server = @last_server,IF(@num <= 24, @tot := @tot + IFNULL(v.votes,0)      ,@tot := 0),@tot := IFNULL(v.votes,0)      ) AS tot
                       , @last_server := v.server AS SERVER
                    -- , v.time
                    -- , v.votes
                    -- , @tot/NULLIF(@cnt,0) AS avg_sofar
                    FROM (SELECT @last_server := NULL, @num:= 0, @cnt := 0, @tot := 0) u
                    JOIN stats v FORCE INDEX FOR ORDER BY (stats_ix1)
                   ORDER BY v.server DESC, v.time DESC
                ) t
          WHERE t.num = 24
       ) s
    ON s.server = r.id

With a covering index on stats(server,time,votes), the EXPLAIN showed MySQL avoided a filesort operation, so it must have used a "reverse index scan" to return the rows in order. Absent the covering index, and index on '(server,time), MySQL used the index if I included an index hint, with theFORCE INDEX FOR ORDER BY (stats_ix1)` hint, MySQL avoided a filesort as well. (But since my table had less than 100 rows, I don't think MySQL put much emphasis on avoiding a filesort operation.)

The time, votes, and avg_sofar expressions are commented out (in the inline view aliased as t); they aren't needed, but they are for debugging.

The way that query stands, it needs at least 24 rows in stats for each server, in order to return an average. (That may be acceptable.) But I was thinking that in general, we could return a running total, total so far (tot) and a running count (cnt).

(If we replace the WHERE t.num = 24 with WHERE t.num <= 24, we can see the running average in action.)

To return the average where there aren't at least 24 rows in stats, that's really a matter of identifying the row (for each server) with the maximum value of num that is <= 24.

spencer7593
  • 106,611
  • 15
  • 112
  • 140
  • Sorry for the late response. This query works, and works faster than the previous answers. I also very much appreciate your detailed explanation and your many solutions for increasing speed. At the moment there are 40,000 rows, however there is the potential for it to increase to several million. I'll use an index (`stats(server,time)`) for now and if there is a significant performance decrease I will likely implement your shadow table suggestion. Thank you very very much! – Jacob Brunson Jun 30 '12 at 23:40
  • A covering index on `stats(server,time,votes)` would be even better for performance. I've added an addendum to my answer, with another query which may be even quicker. It's got a limitation (as its written) that there need to be at least 24 rows in the stats table for a server for an average to be returned. – spencer7593 Jul 02 '12 at 22:10
0

Try this solution, with the top-n-per-group technique in the INNER JOIN subselect credited to Bill Karwin and his post about it here.

SELECT 
    a.*,
    AVG(b.votes) AS avgvotes
FROM
    servers a
INNER JOIN
    (
        SELECT 
            aa.server, 
            aa.votes
        FROM 
            stats aa
        LEFT JOIN stats bb ON 
            aa.server = bb.server AND
            aa.time < bb.time
        GROUP BY
            aa.time
        HAVING
            COUNT(*) < 24
    ) b ON a.id = b.server
GROUP BY
    a.id
Community
  • 1
  • 1
Zane Bien
  • 22,685
  • 6
  • 45
  • 57
  • This query is extremely slow for some reason. I execute it and let it sit for several minutes and it doesn't finish. I don't need extreme speed, but this is much too long. – Jacob Brunson Jun 25 '12 at 20:48
  • @DrAgonmoray Okay, I see. I will attempt at a better solution. What's your indexing structure like? Do you have an index set up on the `time` field? – Zane Bien Jun 25 '12 at 20:54
  • No I do not have an index set up on the time field, however I can add/remove indexes as the solution requires. Currently no indexes are defined for stats. – Jacob Brunson Jun 26 '12 at 19:22
  • @DrAgonmoray, could you post the output of `EXPLAIN` for my query? Also, just out of curiosity, about how many records are we dealing with in `stats`? – Zane Bien Jun 26 '12 at 19:28