0

I have MySQL DB with accumulated data from sensors, and I want to calculate maximum value for a sensor within intervals.

Here is data fragment:

"id",  "dateCreate",           "temp"
"2",   "2017-08-19 03:19:08",  "24.63"
"3",   "2017-08-19 03:24:11",  "24.65"
"14",  "2017-08-19 03:29:15",  "24.63"
"15",  "2017-08-19 03:38:49",  "24.63"
"26",  "2017-08-19 03:43:53",  "24.56"
"37",  "2017-08-19 03:48:55",  "24.60"
"38",  "2017-08-19 03:53:59",  "24.65"
"49",  "2017-08-19 03:59:01",  "24.70"
"50",  "2017-08-19 04:04:04",  "24.75"
"51",  "2017-08-19 04:09:07",  "24.70"
"62",  "2017-08-19 04:14:07",  "24.70"

I want to group such data into chunks of N (size may be different) rows and run MAX (or MIN) over each chunk.

The expected result if I group the sample data into 3 rows would be:

"id",  "dateCreate",           "temp"
"3",   "2017-08-19 03:24:11",  "24.65"
"15",  "2017-08-19 03:38:49",  "24.63"
"50",  "2017-08-19 04:04:04",  "24.75"
"51",  "2017-08-19 04:09:07",  "24.70"

I spend several hours searching for the solution and unfortunately wasn't able to find any working approach.

I imagine the solution may be based on a technique with row numbering and/or limiting rows (with offsetting) within sub-queries, but test queries I managed to create looked awful and didn't work as expected (moreover they were too slow).

Here is an example which gets maximum value within one interval:

SELECT a.*
     FROM (
           SELECT *
             FROM MultiTemp
            ORDER BY dateCreate DESC
            LIMIT 100 -- OFFSET is needed
          ) AS a
 INNER JOIN (
   SELECT MAX(temp) AS temp
     FROM (
           SELECT *
             FROM MultiTemp
            ORDER BY dateCreate DESC
            LIMIT 100 -- OFFSET is needed
          ) AS L
 ) AS M
 ON a.temp = M.temp

Further data processing is done with PHP (to generate graphs), so the above mentioned generic query my be equipped with OFFSET for other intervals. Unfortunately it is too slow...

So far the question is mainly focused on grouping: selecting other data fields may be implemented with approaches like proposed here.

I believe such task (group into chunks of fixed size) is much easier for the engine than grouping by some column, so I probably overlooked something...

Update 1: the column id contains auto-incremented value, but there is no guarantee it is continuous - the table holds data from different devices, and I want to group for one device (I had left this aspect out of scope for brevity). I've adjusted the sample data. Sorry for confusion!

AntonK
  • 1,210
  • 1
  • 16
  • 22
  • You could try make a virtual field for grouping. SELECT *, CONCAT(DATE(dateCreate), HOUR(dateCreate), FLOOR(MINUTE(dateCreate)/15)) as timeslot [...] GROUP BY timeslot. This should give you groups of 15 minutes. 0 = 0-14, 1 = 15-29, 2 = 30-44, 3 = 45-59. – jar3d Nov 03 '17 at 07:42
  • The SQL language does not provide such functionality. Your processing assumes an order of rows. SQL works with sets that are, by their very definition, collections of items and nothing more; there is no order in a set or any relationship between elements of a set that implies an order of them. Because the lack of any order, grouping the records in fixed-sized groups produces indeterminate results. – axiac Nov 03 '17 at 07:42
  • What if there’s a tie for first place? – Strawberry Nov 03 '17 at 07:51
  • @axiac , we can enforce the order with `ORDER BY` to avoid indeterminate results... and unfortunately I can't use `id` since it is not continuous (I've adjusted the sample data). – AntonK Nov 03 '17 at 08:37
  • @jar3d , the calculated field makes it all run extremely slow :( I was also experimenting with grouping by `DATE_FORMAT(dateCreate, '%Y-%m-%d %H:00:00')` and it turned out to be too slow. – AntonK Nov 03 '17 at 08:49
  • 1
    Regarding `ORDER BY`, there is a thin line. In theory, `ORDER BY` doesn't help for this problem, no matter how you use it. In practice, for speed, the engine uses an index to find the rows to process and the index provides the rows in the order you desire. It is not because `ORDER BY` tells it so, it is just a side effect of the implementation. Remember that `ORDER BY` does not tell in what order to process the input rows (and there is no rule about this), it tells in what order to put the output rows. – axiac Nov 03 '17 at 09:11
  • 1
    `DATE_FORMAT()` is slow, as any function or operation that works with strings. `GROUP BY UNIX_TIMESTAMP(dateCreated) DIV (15*60)` is slightly faster because it processes numbers. – axiac Nov 03 '17 at 09:14
  • @axiac , thanks for the hint about `UNIX_TIMESTAMP()` - with it the query runs ~3 times faster if compared to `DATE_FORMAT()` – AntonK Nov 03 '17 at 09:40

1 Answers1

0

Let's try this:

SET @chunk = 3;
SET @serial = @chunk - 1;
SELECT *
FROM (
    SELECT (@serial := @serial+1) DIV @chunk AS `serial_number`,
           `MultiTemp`.*
    FROM `MultiTemp`
    ORDER BY `serial_number`, `temp` DESC
) x
GROUP BY `serial_number`

You can remove temporary field serial_number from last result off course.

For increase price you can try:

SET @chunk = 3;
SET @serial = @chunk - 1;
SELECT *
FROM (
    SELECT (@serial := @serial+1) DIV @chunk AS `serial_number`,
           `y`.*
    FROM (
        SELECT *
        FROM `MultiTemp`
        LIMIT 99
    ) y
    ORDER BY `serial_number`, `temp` DESC
) x
GROUP BY `serial_number`
G.Baghashvili
  • 190
  • 1
  • 2
  • 11
  • Thanks for the code. I've managed to make it working by replacing `SELECT *` with `SELECT serial_number, MAX(temp)`. It takes ~30 seconds on 500K rows. After removing `ORDER BY` (which seems to be redundant here) it becomes ~2 times faster, which is still too slow taking into account extra data I have to fetch from same table... – AntonK Nov 03 '17 at 09:54
  • If you remove `ORDER BY` it returns incorrect result. Let's find another ways to improve speed of the query. – G.Baghashvili Nov 03 '17 at 10:00
  • I've created SQL Fiddle for your code - http://sqlfiddle.com/#!9/88a52d/1 , and here is my version adapted for MySQL 5.7.20 - http://sqlfiddle.com/#!9/88a52d/2 . Both of them return same set of maximum values from `temp`. – AntonK Nov 03 '17 at 10:25
  • There is another fiddle at https://www.db-fiddle.com/f/8gQV1G4fGzYiWz4EmffVWf/0 which supports MySQL 5.7 (it complains with the error `ER_WRONG_FIELD_WITH_GROUP: Expression #2 of SELECT list is not in GROUP BY clause and contains nonaggregated column 'x.id' which is not functionally dependent on columns in GROUP BY clause; this is incompatible with sql_mode=only_full_group_by`) – AntonK Nov 03 '17 at 10:41
  • Data is already ordered in this fiddles and because of this are same results. But your real table is already ordered? – G.Baghashvili Nov 03 '17 at 11:36
  • You can ran this code against this error: `mysql > SET GLOBAL sql_mode=(SELECT REPLACE(@@sql_mode,'ONLY_FULL_GROUP_BY',''));` – G.Baghashvili Nov 03 '17 at 11:37
  • Another fiddle with more data https://www.db-fiddle.com/f/8gQV1G4fGzYiWz4EmffVWf/1 . The data in real DB is usually ordered by `dateCreate`, but IMHO this isn't relevant for this task... – AntonK Nov 03 '17 at 11:53
  • Can you try it with `LIMIT`? See my edition. Query will become faster. – G.Baghashvili Nov 03 '17 at 12:12
  • Well, with `LIMIT` you're just processing a small part of the dataset... it is not a surprise it runs faster :) – AntonK Nov 03 '17 at 14:42
  • Probably you don't need all 500K rows, is it? – G.Baghashvili Nov 03 '17 at 16:21