How Could this Mysql Query be Improved?

Question

This query is trying to do something that mysql doesn't do easily, which is to limit the number of rows per group. A list of user_id's is passed to the query, and a few pieces of returned but the group needs to be limited to 4 rows per group. The query works, but is somewhat slow 200-500ms according to Sequel Pro.

CONTINUE READING before flagging please!!

SELECT id, user_id, article_id, row_number
FROM (
    SELECT a2.id, a2.user_id, a2.post_id,
        @num:= if(@group = a2.user_id, @num + 1, 1) as row_number
    FROM (
        SELECT a1.id, a1.user_id, a1.post_id
        FROM articles as a1
        WHERE a1.user_id IN (3,14,1,2,3,4,5,6,7,8,9,10,11,12,14,15,16,17,18,19,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,38,39,13,114,1111,12,223,2234,225,226,227,228,229,2210)
        ORDER BY a1.date DESC
    ) as a2, 
    (SELECT @num := 0) t
) as f
WHERE row_number <= 4;

The EXPLAIN for this query is:

id  select_type table   type    possible_keys   key key_len ref rows    Extra
1   PRIMARY <derived2>  ALL         NULL    NULL    NULL    NULL    10516   Using where
2   DERIVED <derived4>  system      NULL    NULL    NULL    NULL    1   
2   DERIVED <derived3>  ALL         NULL    NULL    NULL    NULL    10516   
4   DERIVED NULL        NULL        NULL    NULL    NULL    NULL    NULL    No tables used
3   DERIVED s1          ALL         Reco... NULL    NULL    NULL    1180931 Using filesort

I thought about breaking this up into multiple queries but I still seem to be running into the issue of limiting each group result to 4. All in all I'm trying to avoid a lot of queries & expensive queries.

Any ideas on the best way to go about improving this query's speed by breaking it up and moving some of it into the application?

possible duplicate of [mysql: Using LIMIT within GROUP BY to get N results per group?](http://stackoverflow.com/questions/2129693/mysql-using-limit-within-group-by-to-get-n-results-per-group) — Marc B, Jul 22 '13 at 19:27
@MarcB you immediately flagged this, however it's not a duplicate, I'm looking for an alternative solution to the answer that was provided via Baron Schwartz. I'm looking to move the heavy lifting out of the query and into the application if possible. — stwhite, Jul 22 '13 at 19:40
What are you trying to group by? Are you trying to get a list of the latest 4 articles per user? — noz, Jul 22 '13 at 21:44
@noz: the query does not do a group by. It is filtering out the fifth, sixth, etc. consecutive, contiguous (by `date` column) rows for a single user_id (`@group`) whenever that sequence of more than four rows for that user are not "interrupted" by a row from one of the other users in the IN list. This isn't really a GROUP BY query at all. — spencer7593, Jul 22 '13 at 22:19
@spencer7593 Sorry not very good phrasing - I don't mean 'group by' in the SQL sense, probably should have said 'group on'. I am trying to clarify what is actually required out of the query as the query given is a little odd. — noz, Jul 22 '13 at 22:28
@noz: the query is more than a little odd, it is *VERY* odd. — spencer7593, Jul 22 '13 at 22:31
The query in the question does not work (article_id not defined). And the explain relates to another query (where does s1 come from?). — noz, Jul 22 '13 at 22:38
@noz: I think `s1` is actually `a1`, and `f.article_id` is actually `a2.post_id`. Such is the way of things when SQL gets sanitized for a question. According to the OP, "the query works". — spencer7593, Jul 22 '13 at 22:51
I changed a couple things on the query before posting here which have been causing confusing... s1 should be a1. — stwhite, Jul 23 '13 at 19:16
Clarification: Your grouping. If for example, your user IDs 1, 2, 3 EACH have entries for Oct 1-6 (1 each day), you only want those for Oct 3-6 (last 4 entries). So, for 3 users, you would have 12 records? But you mention date groups... so if ex: User 1 has dates for 6 for Oct 1, 6 for Oct 2, 6 for Oct 3, what do you want... just the 4 from Oct 3? or do you want the last 4 from User 1 Oct 1 (4 entries), User 1, Oct 2 (4 entries), User 1 Oct 3 (4 entries). Please try to update post and sample data to be clear. — DRapp, Oct 06 '14 at 16:06

spencer7593 · Accepted Answer · 2013-07-22T22:30:58.857

To answer your question, I don't see any effective way to "break up" this query. You would still need to figure out whether articles from that one user_id (@group)are contiguous by date, with no intervening posts from one of the other user_ids. And having all the rows sorted together, by date, is going to be the best way to do that.

If the number of rows that are being eliminated is a large subset of the rows, then filtering these on the client side would require sending a larger resultset to the client. But if it's a small fraction of rows being filtered out, then that makes the transfer of all the rows (for all of the users in the list) to the client for processing more appealing.

SELECT a.id
     , a.user_id
     , a.post_id
  FROM articles a
 WHERE a.user_id IN (3,14,1,2,3,4,5,6,7,8,9,10,11,12,...)
 ORDER BY a.date DESC

Then the client could fetch through the rows, checking for contiguous sequences of rows for that single user_id (@group), and just ignoring the fifth, sixth, etc. rows until a row with a different user_id is found.

If the specifications for the resultset were different, it might be possible to break it up. But the way the query is written now, the resultsets from any "broken up" queries would need to be combined, in order to get the same resultset currently returned by the current query.

(This query is SIGNIFICANTLY different from the query in the question marked by Marc B as a possible duplicate.)

This is an odd resultset; we don't see anywhere that @group is being assigned a value in the statement, so presumably that is being set prior to the execution of this statement. So, the expression

@group = a2.user_id

tests whether the user_id is equal to a constant. That means that the query is identifying rows from articles that are posted by a single user_id, and incrementing the row_number whenever that user posts two (or more) articles in succession, with no intervening article posted by any other user_id in the IN list (as order by the DATE column). An article posted by another user_id (in the IN list), will reset the counter to 1.

The net effect is that this query is returning all articles from ALL users specified in the IN list EXCEPT for a single user_id (which may or may not be in the list.) Whenever there are five or more articles posted contiguously by that one single constant user_id, with no intervening articles from another user_id in the IN list... whenever that happens, the query keeps only the first four (latest four) rows of contiguous articles from that one specified user_id.

If the date column is DATE datatype, with no time component, it's much more likely that you are going to have multiple rows with the same date. And there is no ordering specified beyond the date column, so the resultset is indeterminate. (That is, there can be multiple sequences of the same set of rows which will satisfy the ORDER BY.) It's also indeterminate with a DATETIME, but if most of those values include unique-ish time components (i.e. other than a constant such as midnight), then that is less likely an issue.

What's odd about it, is that the same set of rows could be ordered two ways, and give different results. Assuming @group identifies user 'abc':

Date       user   id        Date       user   id
---------- ------ --        ---------- ------ --
2103-07-22 abc     1        2103-07-22 abc     1
2103-07-22 abc     2        2103-07-22 abc     2
2103-07-22 abc     3        2103-07-22 abc     3
2103-07-22 EFGHI   4        2103-07-22 abc     5
2103-07-22 abc     5        2103-07-22 abc     6
2103-07-22 abc     6        2103-07-22 abc     7
2103-07-22 abc     7        2103-07-22 EFGHI   4

7 rows selected.            5 rows selected.

Both resultsets are consistent with the specification, so either could be returned.

There's nothing wrong with returning a resultset like that. It's just a bit odd.

In terms of performance, an index with a leading column of (user_id) might be suitable for the predicate in the WHERE clause, if that is eliminating a large percentage of rows.

Or, an index with leading columns of (date,user_id) might be more suitable, in that MySQL could avoid a "Using filesort" operation, and retrieve the rows in descending date order, and then filter out the rows with the predicate on user_id as the rows are accessed.

Actually, a covering index on columns (date, user_id, post_id, id) might be even more beneficial.

score 0 · Answer 2 · answered Jul 22 '13 at 21:52

Bit of an assumption here - if you are trying to list the latest 4 articles for each of the users in the given list of users, I think your query may be better as:

SET @gr=0, @row=0;
SELECT 
    id,user_id,post_id,row_number
FROM
    (SELECT 
        id,
            user_id,
            post_id,
            @row:=if(user_id <> @gr, 0, @row + 1) as row_number,
            @gr:=user_id
    FROM
        articles
    WHERE
        user_id IN (3 , 14, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 14, 15, 16, 17, 18, 19, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 38, 39, 13, 114, 1111, 12, 223, 2234, 225, 226, 227, 228, 229, 2210)
    ORDER BY user_id , date DESC) as a1
WHERE
    row_number < 4

NOTE: the resultset from this query is SIGNIFICANTLY different from the resultset from the query in the OP question. — spencer7593, Jul 22 '13 at 22:25
It is significantly - hence the 'Bit of an assumption' at the start. — noz, Jul 22 '13 at 22:38

score 0 · Answer 3 · answered Oct 06 '14 at 15:08

Might be possible to do this avoiding the use of a variable.

Join the table against itself, joining on the user id and the date, finding all those articles with a date greater than or the same. Then get the number of matching articles grouped by the fields you actually want, and discard those where the count is more than 4.

Not tested by something like this.

SELECT a1.id, a1.user_id, a1.post_id, COUNT(a1_plus.id) AS other_count
FROM articles as a1
INNER JOIN articles a1_plus
ON a1.user_id = a1_plus.user_id
AND a1.date <= a1_plus.date
WHERE a1.user_id IN (3,14,1,2,3,4,5,6,7,8,9,10,11,12,14,15,16,17,18,19,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,38,39,13,114,1111,12,223,2234,225,226,227,228,229,2210)
GROUP BY a1.id, a1.user_id, a1.post_id
HAVING other_count <= 4

How Could this Mysql Query be Improved?

3 Answers3