GROUP BY specified number of rows

Question

I have a table of data which is always sorted by date. I want to group lets say 10 000 LAST rows by 1 000. If I have lets say 15200 rows, 10 groups should consist of rows 5201-6200, 6201-7200, ... ,13201-14200 , 14201-15200. Rows in a table are not numerated. I need this to find sum of values in each of 10 groups

SELECT SUM(quantity)
FROM dataTable
GROUP BY ???

Consider making a smaller, simpler request here, providing data to suit, and then scaling up from there. — Strawberry, Jul 09 '14 at 14:38
As mentioned in this [post](http://stackoverflow.com/questions/2129693/mysql-using-limit-within-group-by-to-get-n-results-per-group), you can check out this [article](http://www.xaprb.com/blog/2006/12/07/how-to-select-the-firstleastmax-row-per-group-in-sql/), which introduces an elegant but naïve solution to the "Top N per group" problem. — Bla..., Jul 09 '14 at 14:41
"I have a table of data which is always sorted by date". So you *always* execute an `ALTER TABLE datatable ORDER BY thedate` after an insert into the table? Even if you really do so, I would never rely on this implicit order. This is an option built mainly to speed up the sorting, but you should still use order by to guarantee correct sorting. (And I actually wonder how they actually speed up sorting by having the table pre-sorted. Do they use a sorting algorithm that's extremely fast on pre-sorted data?) — Thorsten Kettner, Jul 10 '14 at 11:45

spencer7593 · Accepted Answer · 2014-07-09T15:07:35.920

You can use an inline view and user-defined variables to emulate a ROW_NUMBER analytic function, and then "group by" an expression that uses ROW_NUMBER to derive a common value for each "group" of rows.

Something like this, for example:

SELECT SUM(v.quantity) AS sum_quantity
  FROM (
         SELECT d.quantity
              , @rn := @rn + 1 AS rn
           FROM dataTable d
          CROSS
           JOIN (SELECT @rn := -1) i
          ORDER BY d.quantity
          LIMIT 10000
       ) v
 GROUP BY v.rn DIV 1000

The inline view (assigned an alias of v) assigns a "row number" to each row from dataTable. (The inline view aliased as i initializes @rn, a MySQL user-defined variable. We don't really care what that query returns, except that we need the query to return exactly one row because of the JOIN operation; what we really care about is that initializes the user-defined variable, before the rest of the statement runs.

As each row is processed, @rn is incremented by 1, and the current value is returned as a "row number" column (assigned an alias of rn) in the resultset.

(NOTE: we chose to intialize @rn to -1 rather than 0, so that the rn column will be returned with values starting at 0. This will be more convenient than starting rn values at 1, when we later "group" the rows into groups of 1000 using an integer division operation.)

The ORDER BY clause is optional; but if there's a particular "order" by which you determine which 10,000 rows are the "last" row, you'd specify that in the ORDER BY clause. (I specified the quantity column here, because that's the only column I know about in the table... I don't have any information about what identifies whether a row is "before" or "after" some other row.)

The LIMIT clause limits the number of rows returned, so we only get 10,000 rows. This will be the "first" 10,000 rows returned from the query, whatever is specified. (To get the rows in reverse order, so the "last" rows are returned first, add the DESC keyword to the ORDER BY clause.)

The GROUP BY on the outer query uses "integer division" operator. With that expression, rn values 0 thru 999 evaluate to 0, rn values 1000 thru 1999 evaluate to 1, etc.

You can run just the inline view query, to see how the row number is being assigned.

You can add additional expressions to the SELECT list in the outer query, to demonstrate how the statement is working, e.g.

SELECT SUM(v.quantity)    AS sum_quantity
     , SUM(1)             AS row_count
     , MIN(v.rn)          AS rn_min
     , MAX(v.rn)          AS rn_max
     , MAX(v.rn) DIV 1000 AS rn_div_1000

I may be wrong, but I think this query could fail. It works in the MySQL Version I tried, but still: You ask the dbms to cross join dataTable with row numbers. You want the result sorted and cut to 10000. Obviously MySQL sorts first and only then applies the cross join, which is fine but not guaranteed afaik (and actually even not how one would expect the query to work). If in a future version MySQL would apply the cross join first, you would end up with row numbers in random order, thus grouping incorrectly. — Thorsten Kettner, Jul 10 '14 at 10:12
@Thorsten: The `JOIN` is to an inline view that returns exactly one row. There's no problem with the `JOIN`. The "row numbers" don't come from the `JOIN` at all. (That could be removed, and the SELECT statement could be preceded with a separate `SET` statement to initialize the user-defined variable. — spencer7593, Jul 10 '14 at 16:19
@ThorstenKettner: The "row numbers" are generated by an expression in the SELECT list. The way MySQL processes this, it first orders the rows to be returned, and then it processes the expressions in the SELECT list. You are right to point out that that this behavior could change in a future release of MySQL. If MySQL were to process the expressions in the `SELECT` list before it processed the `ORDER BY`, you are correct; the rows would not be guaranteed to be in "row number" sequence. I added the `ORDER BY` clause as a demonstration of where to put it; OP can supply an appropriate expression. — spencer7593, Jul 10 '14 at 16:26
@ThorstenKettner: the approach of the statement in your answer isn't really any different. Sure, you've got an extra inline view with an `ORDER BY`, and then an outer query that assigns the row numbers. A future change in behavior that affects the results returned by my query may also also affect the results returned by the query in your answer. — spencer7593, Jul 10 '14 at 16:33
You are right, I didn't even see this. Well, I select the 10000 rows first and only then cross join with row numbers, but still, cross joins don't happen in any specific order. I rely on the cross join to stick to the sort order just used, which is not guaranteed to be the case. You are right; I'll update my answer. — Thorsten Kettner, Jul 11 '14 at 06:31

Thorsten Kettner · Answer 2 · 2014-07-11T06:56:16.543

First: How to get the last 10000 rows? Sort by date and cut after 10000st row.

select quantity
from datatable
order by thedate desc
limit 10000;

Then how to create row numbers which you can build groups with? Cross join the result with an incrementing variable. (Here I must rely on the fact that MySQL increments the variable after sorting the rows. This is no guaranteed behavior. In fact, I haven't found anything on this in the docs. Maybe they will gurantantee this one day, because many people already rely on this behavior.)

select @rownum := @rownum + 1 as rn, q.quantity
from
(
  select quantity, thedate
  from datatable
  order by thedate desc
  limit 10000
) q
cross join (select @rownum := 0) r
order by thedate desc;

At last build groups, by dividing the row number by 1000:

select truncate( (rn-1) / 1000, 0 ) as groupno, sum(quantity)
from
(
  select @rownum := @rownum + 1 as rn, q.quantity
  from
  (
    select quantity, thedate
    from datatable
    order by thedate desc
    limit 10000
  ) q
  cross join (select @rownum := 0) r
  order by thedate desc
)
group by groupno;

GROUP BY specified number of rows

2 Answers2