7

I have a very large MySQL table containing data read from a number of sensors. Essentially, there's a time stamp and a value column. I'll omit the sensor id, indexes other details here:

CREATE TABLE `data` (
  `time` datetime NOT NULL,
  `value` float NOT NULL
)

The value column rarely changes, and I need to find the points in time when those changes occur. Suppose there's a value every minute, the following query returns exactly what I need:

SELECT d.*, 
  (SELECT value FROM data WHERE time<d.time ORDER by time DESC limit 1) 
    AS previous_value 
FROM data d 
HAVING d.value<>previous_value OR previous_value IS NULL;

+---------------------+-------+----------------+
| time                | value | previous_value |
+---------------------+-------+----------------+
| 2011-05-23 16:05:00 |     1 |           NULL |
| 2011-05-23 16:09:00 |     2 |              1 |
| 2011-05-23 16:11:00 |   2.5 |              2 |
+---------------------+-------+----------------+

The only problem is that this is very inefficient, mostly due to the dependent subquery. What would be the best way to optimize this using the tools that MySQL 5.1 has to offer?

One last constraint is that the values are not ordered before they are inserted into the data table and that they might be updated at a later point. This might affect any possible de-normalization strategies.

cg.
  • 3,648
  • 2
  • 26
  • 30
  • What indexes are there in the table? – ypercubeᵀᴹ May 24 '11 at 11:55
  • Sidenote: Its a bad habit to have tables or fields named `time` or `datetime` or `date` or `float`, etc. – ypercubeᵀᴹ May 24 '11 at 11:57
  • @ypercube: Aside from keys on columns not shown in this example (like a synthetic primary key), there's a unique key on the time column. – cg. May 24 '11 at 12:03
  • @ypercube: Thanks for the hint, but this is not my actual code. I shortened and condensed it to make my question more concise. I might have overdone with respect to the time column. :) – cg. May 24 '11 at 12:05

2 Answers2

4

I suppose it's not an option for you to switch DB engine. In case it might be, then window functions would allow you to write things like this:

SELECT d.*
FROM (
    SELECT d.*, lag(d.value) OVER (ORDER BY d.time) as previous_value 
    FROM data d
  ) as d
WHERE d.value IS DISTINCT FROM d.previous_value;

If not, you could try to rewrite the query like so:

select data.*
from data
left join (
    select data.measure_id,
           data.time,
           max(prev_data) as prev_time
    from data
    left join data as prev_data
    on prev_data.time < data.time
    group by data.measure_id, data.time, data.value
    ) as prev_data_time
on prev_data_time.measure_id = data.measure_id
and prev_data_time.time = data.time
left join prev_data_value
on prev_data_value.measure_id = data.measure_id
and prev_data_value.time = prev_data_time.prev_time
where data.value <> prev_data_value.value or prev_data_value.value is null
Denis de Bernardy
  • 75,850
  • 13
  • 131
  • 154
  • @Denis, note that `group by` already order the elements listed in it, so the last `order by ..` is not needed. – Johan May 24 '11 at 12:26
  • 2
    True, but that ordering is an implementation side-effect, rather than the SQL standard. You never know when MySQL will drop the side-effect (Oracle did). :-) – Denis de Bernardy May 24 '11 at 12:29
  • You can also experiment with index on `(value,time)` or `(sensor_id,value,time)` and see the query plan using this index. – ypercubeᵀᴹ May 24 '11 at 12:35
  • @Denis, Thanks a lot for your time! Could you please explain the column measure_id in your example? Is that supposed to be the primary key for the data table or a foreign key? – cg. May 24 '11 at 12:35
  • @cg: the primary key of the data table. – Denis de Bernardy May 24 '11 at 12:37
  • @Denis: Ok I just create the data table using your column names and inserted a few rows for testing purposes. The sequence of test values ordered by time is (3,2,1,1,1,2,2.5,2.5,2). When I execute your query, I'm getting all the rows and not just those that mark a value change. Looking at the query, I don't quite see why it should work. Maybe I'm missing some critical point... – cg. May 24 '11 at 13:27
  • The new one should work as expected, but the join on the aggregate should make it marginally faster than the current one. :-| – Denis de Bernardy May 24 '11 at 13:57
  • Should the line "left join prev_data_value" actually read "left join data prev_data_value"? – youcantryreachingme Mar 09 '17 at 04:39
3

You might try this - I'm not going to guarantee that it will perform better, but it's my usual way of correlating a row with a "previous" row:

SELECT
    * --TODO, list columns
FROM
    data d
       left join
    data d_prev
       on
           d_prev.time < d.time --TODO - Other key columns?
       left join
    data d_inter
       on
           d_inter.time < d.time and
           d_prev.time < d_inter.time --TODO - Other key columns?
WHERE
    d_inter.time is null AND
    (d_prev.value is null OR d_prev.value <> d.value)

(I think this is right - could do with some sample data to validate it).

Basically, the idea is to join the table to itself, and for each row (in d), find candidate rows (in d_prev) for the "previous" row. Then do a further join, to try to find a row (in d_inter) that exists between the current row (in d) and the candidate row (in d_prev). If we cannot find such a row (d_inter.time is null), then that candidate was indeed the previous row.

Damien_The_Unbeliever
  • 234,701
  • 27
  • 340
  • 448
  • Great! This is actually the kind of "trick" that I was looking for. You query is magnitudes faster than the original one. It's still not fast enough to be used directly but it could be the basis for the data aggregation I need. Thank you very much for you answer. – cg. May 24 '11 at 14:48
  • I'll vote it up now and accept it in a few days if no better solution comes up. – cg. May 24 '11 at 14:54
  • I think you may also technically need OR d.value is null in that last bracketed statement of the WHERE clause. – user1383092 Apr 22 '16 at 10:01
  • @user1383092 - from the question - `value float NOT NULL`. We only end up generating `NULL`s in columns from the right hand side of `LEFT JOIN`s. But `d` is on the left hand side of those joins. Therefore, it's `value` can never be `NULL`. – Damien_The_Unbeliever Apr 22 '16 at 10:11