To answer your question, I don't see any effective way to "break up" this query. You would still need to figure out whether articles from that one user_id (@group)are contiguous by date, with no intervening posts from one of the other user_ids. And having all the rows sorted together, by date, is going to be the best way to do that.
If the number of rows that are being eliminated is a large subset of the rows, then filtering these on the client side would require sending a larger resultset to the client. But if it's a small fraction of rows being filtered out, then that makes the transfer of all the rows (for all of the users in the list) to the client for processing more appealing.
SELECT a.id
, a.user_id
, a.post_id
FROM articles a
WHERE a.user_id IN (3,14,1,2,3,4,5,6,7,8,9,10,11,12,...)
ORDER BY a.date DESC
Then the client could fetch through the rows, checking for contiguous sequences of rows for that single user_id (@group), and just ignoring the fifth, sixth, etc. rows until a row with a different user_id is found.
If the specifications for the resultset were different, it might be possible to break it up. But the way the query is written now, the resultsets from any "broken up" queries would need to be combined, in order to get the same resultset currently returned by the current query.
(This query is SIGNIFICANTLY different from the query in the question marked by Marc B as a possible duplicate.)
This is an odd resultset; we don't see anywhere that @group
is being assigned a value in the statement, so presumably that is being set prior to the execution of this statement. So, the expression
@group = a2.user_id
tests whether the user_id
is equal to a constant. That means that the query is identifying rows from articles
that are posted by a single user_id, and incrementing the row_number whenever that user posts two (or more) articles in succession, with no intervening article posted by any other user_id in the IN
list (as order by the DATE column). An article posted by another user_id (in the IN list), will reset the counter to 1.
The net effect is that this query is returning all articles from ALL users specified in the IN list EXCEPT for a single user_id (which may or may not be in the list.) Whenever there are five or more articles posted contiguously by that one single constant user_id, with no intervening articles from another user_id in the IN list... whenever that happens, the query keeps only the first four (latest four) rows of contiguous articles from that one specified user_id.
If the date
column is DATE datatype, with no time component, it's much more likely that you are going to have multiple rows with the same date. And there is no ordering specified beyond the date
column, so the resultset is indeterminate. (That is, there can be multiple sequences of the same set of rows which will satisfy the ORDER BY.) It's also indeterminate with a DATETIME, but if most of those values include unique-ish time components (i.e. other than a constant such as midnight), then that is less likely an issue.
What's odd about it, is that the same set of rows could be ordered two ways, and give different results. Assuming @group identifies user 'abc':
Date user id Date user id
---------- ------ -- ---------- ------ --
2103-07-22 abc 1 2103-07-22 abc 1
2103-07-22 abc 2 2103-07-22 abc 2
2103-07-22 abc 3 2103-07-22 abc 3
2103-07-22 EFGHI 4 2103-07-22 abc 5
2103-07-22 abc 5 2103-07-22 abc 6
2103-07-22 abc 6 2103-07-22 abc 7
2103-07-22 abc 7 2103-07-22 EFGHI 4
7 rows selected. 5 rows selected.
Both resultsets are consistent with the specification, so either could be returned.
There's nothing wrong with returning a resultset like that. It's just a bit odd.
In terms of performance, an index with a leading column of (user_id)
might be suitable for the predicate in the WHERE clause, if that is eliminating a large percentage of rows.
Or, an index with leading columns of (date,user_id)
might be more suitable, in that MySQL could avoid a "Using filesort" operation, and retrieve the rows in descending date order, and then filter out the rows with the predicate on user_id as the rows are accessed.
Actually, a covering index on columns (date, user_id, post_id, id)
might be even more beneficial.