This is another approach.
This query is going to suffer the same performance problems as other queries here that return correct results, because the execution plan for this query is going to require a SORT operation on EVERY row in the stats table. Since there is no predicate (restriction) on the time column, EVERY row in the stats table will be considered. For a REALLY large stats
table, this is going to blow out all available temporary space before it dies a horrible death. (More notes on performance below.)
SELECT r.*
, IFNULL(s.avg_votes,0)
FROM servers r
LEFT
JOIN ( SELECT t.server
, AVG(t.votes) AS avg_votes
FROM ( SELECT CASE WHEN u.server = @last_server
THEN @i := @i + 1
ELSE @i := 1
END AS i
, @last_server := u.server AS `server`
, u.votes AS votes
FROM (SELECT @i := 0, @last_server := NULL) i
JOIN ( SELECT v.server, v.votes
FROM stats v
ORDER BY v.server DESC, v.time DESC
) u
) t
WHERE t.i <= 24
GROUP BY t.server
) s
ON s.server = r.id
What this query is doing is sorting the stats table, by server and by descending order on the time column. (Inline view aliased as u
.)
With the sorted result set, we assign a row numbers 1,2,3, etc. to each row for each server. (Inline view aliased as t
.)
With that result set, we filter out any rows with a rownumber > 24, and we calculate an average of the votes
column for the "latest" 24 rows for each server. (Inline view aliased as s
.)
As a final step, we join that to the servers table, to return the requested resultset.
NOTE:
The execution plan for this query will be COSTLY for a large number of rows in the stats
table.
To improve performance, there are several approaches we could take.
The simplest might to be include in the query a predicate the EXCLUDES a significant number of rows from the stats
table (e.g. rows with time
values over 2 days old, or over 2 weeks old). That would significantly reduce the number of rows that need to be sorted, to determine the "latest" 24 rows.
Also, with an index on stats(server,time)
, it's also possible that MySQL could do a relatively efficient "reverse scan" on the index, avoiding a sort operation.
We could also consider implementing an index on the stats table on (server,"reverse_time")
. Since MySQL doesn't yet support descending indexes, the implementation would really be a regular (ascending) index on an a derived rtime
value (a "reverse time" expression that is ascending for descending values of time
(for example, -1*UNIX_TIMESTAMP(my_timestamp)
or -1*TIMESTAMPDIFF('1970-01-01',my_datetime)
.
Another approach to improve performance would be to keep a shadow table containing the most recent 24 rows for each server. That would be simplest to implement if we can guarantee that "latest rows" won't be deleted from the stats
table. We could maintain that table with a trigger. Basically, whenever a row is inserted into the stats
table, we check if the time
on the new rows is later than the earliest time
stored for the server in the shadow table, if it is, we replace the earliest row in the shadow table with the new row, being sure to keep no more than 24 rows in the shadow table for each server.
And, yet another approach is to write a procedure or function that gets the result. The approach here would be to loop through each server, and run a separate query against the stats table to get the average votes
for the latest 24 rows, and gather all of those results together. (That approach mighty really be more of a workaround to avoiding a sort on huge temporary set, just to enable the resultset to be returned, not necessarily making the return of the resultset blazingly fast.)
The bottom line for performance of this type of query on a LARGE table is restricting the number of rows considered by the query AND avoiding a sort operation on a large set. That's how we get a query like this to perform.
ADDENDUM
To get a "reverse index scan" operation (to get the rows from stats
ordered using an index WITHOUT a filesort operation), I had to specify DESCENDING on both expressions in the ORDER BY clause. The query above previously had ORDER BY server ASC, time DESC
, and MySQL always wanted to do a filesort, even specifying the FORCE INDEX FOR ORDER BY (stats_ix1)
hint.
If the requirement is to return an 'average votes' for a server only if there are at least 24 associated rows in the stats table, then we can make a more efficient query, even if it is a bit more messy. (Most of the messiness in the nested IF() functions is to deal with NULL values, which do not get included in the average. It can be much less messy if we have a guarantee that votes
is NOT NULL, or if we exclude any rows where votes
is NULL.)
SELECT r.*
, IFNULL(s.avg_votes,0)
FROM servers r
LEFT
JOIN ( SELECT t.server
, t.tot/NULLIF(t.cnt,0) AS avg_votes
FROM ( SELECT IF(v.server = @last_server, @num := @num + 1, @num := 1) AS num
, @cnt := IF(v.server = @last_server,IF(@num <= 24, @cnt := @cnt + IF(v.votes IS NULL,0,1),@cnt := 0),@cnt := IF(v.votes IS NULL,0,1)) AS cnt
, @tot := IF(v.server = @last_server,IF(@num <= 24, @tot := @tot + IFNULL(v.votes,0) ,@tot := 0),@tot := IFNULL(v.votes,0) ) AS tot
, @last_server := v.server AS SERVER
-- , v.time
-- , v.votes
-- , @tot/NULLIF(@cnt,0) AS avg_sofar
FROM (SELECT @last_server := NULL, @num:= 0, @cnt := 0, @tot := 0) u
JOIN stats v FORCE INDEX FOR ORDER BY (stats_ix1)
ORDER BY v.server DESC, v.time DESC
) t
WHERE t.num = 24
) s
ON s.server = r.id
With a covering index on stats(server,time,votes)
, the EXPLAIN showed MySQL avoided a filesort operation, so it must have used a "reverse index scan" to return the rows in order. Absent the covering index, and index on '(server,time), MySQL used the index if I included an index hint, with the
FORCE INDEX FOR ORDER BY (stats_ix1)` hint, MySQL avoided a filesort as well. (But since my table had less than 100 rows, I don't think MySQL put much emphasis on avoiding a filesort operation.)
The time, votes, and avg_sofar expressions are commented out (in the inline view aliased as t
); they aren't needed, but they are for debugging.
The way that query stands, it needs at least 24 rows in stats for each server, in order to return an average. (That may be acceptable.) But I was thinking that in general, we could return a running total, total so far (tot) and a running count (cnt).
(If we replace the WHERE t.num = 24
with WHERE t.num <= 24
, we can see the running average in action.)
To return the average where there aren't at least 24 rows in stats, that's really a matter of identifying the row (for each server) with the maximum value of num that is <= 24.