-1

I am trying to optimize a MySQL query. I am trying to update the column of a table with a moving average of the price of an item every 15 minutes for a specific store.

My table has the following structure

╔═════╦═════════════════════╦════════════╦══════╦════════════════╗
║ ID  ║      DATETIME       ║   NAME     ║Price ║ 15_MIN_AVERAGE ║
╠═════╬═════════════════════╬════════════╬══════╬════════════════╣
║ 1   ║ 2000-01-01 00:00:05 ║ WALMART    ║   1  ║                ║
║ 2   ║ 2000-01-01 00:00:05 ║ BESTBUY    ║   6  ║                ║
║ 3   ║ 2000-01-01 00:00:05 ║ RADIOSHACK ║   2  ║                ║
║ 4   ║ 2000-01-01 00:00:10 ║ WALMART    ║   6  ║                ║
║ 5   ║ 2000-01-01 00:00:10 ║ BESTBUY    ║   2  ║                ║   
║ 6   ║ 2000-01-01 00:00:10 ║ RADIOSHACK ║   8  ║                ║
║ 7   ║ 2000-01-01 00:00:15 ║ WALMART    ║  10  ║                ║
║ 8   ║ 2000-01-01 00:00:15 ║ BESTBUY    ║   2  ║                ║
║ 9   ║ 2000-01-01 00:00:15 ║ RADIOSHACK ║   3  ║                ║
║ 10  ║ 2000-01-01 00:00:20 ║ WALMART    ║   6  ║                ║
║ 11  ║ 2000-01-01 00:00:20 ║ BESTBUY    ║   4  ║                ║
║ 12  ║ 2000-01-01 00:00:20 ║ RADIOSHACK ║   5  ║                ║
║ 13  ║ 2000-01-01 00:00:25 ║ WALMART    ║   1  ║                ║
║ 14  ║ 2000-01-01 00:00:25 ║ BESTBUY    ║   0  ║                ║
║ 15  ║ 2000-01-01 00:00:25 ║ RADIOSHACK ║   5  ║                ║
║ 16  ║ 2000-01-01 00:00:30 ║ WALMART    ║   1  ║                ║
║ 17  ║ 2000-01-01 00:00:30 ║ BESTBUY    ║   6  ║                ║
║ 18  ║ 2000-01-01 00:00:30 ║ RADIOSHACK ║   2  ║                ║
║ 19  ║ 2000-01-01 00:00:35 ║ WALMART    ║   6  ║                ║
║ 20  ║ 2000-01-01 00:00:35 ║ BESTBUY    ║   2  ║                ║
║ 21  ║ 2000-01-01 00:00:35 ║ RADIOSHACK ║   8  ║                ║
║ 22  ║ 2000-01-01 00:00:40 ║ WALMART    ║  10  ║                ║
║ 23  ║ 2000-01-01 00:00:40 ║ BESTBUY    ║   2  ║                ║
║ 24  ║ 2000-01-01 00:00:40 ║ RADIOSHACK ║   3  ║                ║
║ 25  ║ 2000-01-01 00:00:45 ║ WALMART    ║   6  ║                ║
║ 26  ║ 2000-01-01 00:00:45 ║ BESTBUY    ║   4  ║                ║
║ 27  ║ 2000-01-01 00:00:45 ║ RADIOSHACK ║   5  ║                ║
║ 28  ║ 2000-01-01 00:00:48 ║ WALMART    ║   1  ║                ║
║ 29  ║ 2000-01-01 00:00:48 ║ BESTBUY    ║   0  ║                ║
║ 30  ║ 2000-01-01 00:00:48 ║ RADIOSHACK ║   5  ║                ║
║ 31  ║ 2000-01-01 00:00:50 ║ WALMART    ║   6  ║                ║
║ 32  ║ 2000-01-01 00:00:50 ║ BESTBUY    ║   4  ║                ║
║ 33  ║ 2000-01-01 00:00:50 ║ RADIOSHACK ║   5  ║                ║
║ 34  ║ 2000-01-01 00:00:55 ║ WALMART    ║   1  ║                ║
║ 35  ║ 2000-01-01 00:00:55 ║ BESTBUY    ║   0  ║                ║
║ 36  ║ 2000-01-01 00:00:55 ║ RADIOSHACK ║   5  ║                ║
║ 37  ║ 2000-01-01 00:01:00 ║ WALMART    ║   1  ║                ║
║ 38  ║ 2000-01-01 00:01:00 ║ BESTBUY    ║   0  ║                ║
║ 39  ║ 2000-01-01 00:01:00 ║ RADIOSHACK ║   5  ║                ║
╚═════╩═════════════════════╩════════════╩══════╩════════════════╝

My query is:

UPDATE my_table AS t 
INNER JOIN 
( select ID,
    (select avg(price) from my_table as t2
     where
        t2.datetime between subtime(t1.datetime, '00:14:59') and t1.datetime AND
        t2.name = t1.name
    ) as average
from my_table as t1
where
    minute(datetime) in (0,15,30,45) ) as sel
ON t.ID = sel.ID
SET 15_MIN_AVERAGE = average

I have an index on column DATETIME (which is of type DATETIME), but I think using a function such as minute() and subtime() in a where clause basically nullifies the index.

My table has around 1.6 million records (about one record every 5 minutes). Currently, it takes a very long time to run this query (over an hour), which is unacceptable.

What do you recommend to optimize it?

Thank you very much!

Vincent L
  • 699
  • 2
  • 11
  • 25

3 Answers3

0

I think is better you create a range table for this. Here is a nice example

generate days from date range

A table like this for 10 years * 365 days * 24 hours * 4 quarter = 350k rows. But the index will work perfect.

So your table should looks like:

  id    start                 end
  1     2016-11-10 10:00:00   2016-11-10 10:04:59
  2     2016-11-10 10:05:00   2016-11-10 10:09:59
  3     2016-11-10 10:10:00   2016-11-10 10:14:59

And your query will assign and id to each datetime

 SELECT t.name, r.id, AVG(t.price)
 FROM my_table t
 JOIN range r   
   ON t.`DATETIME` BETWEEN r.start
                       AND r.end
 GROUP BY t.name, r.id

alternative

  id    start                 end
  1     2016-11-10 10:00:00   2016-11-10 10:05:00
  2     2016-11-10 10:05:00   2016-11-10 10:10:00
  3     2016-11-10 10:10:00   2016-11-10 10:15:00


 SELECT t.name, r.id, AVG(t.price)
 FROM my_table t
 JOIN range r   
   ON t.`DATETIME` >= r.start AND t.`DATETIME` < r.end
 GROUP BY t.name, r.id
Community
  • 1
  • 1
Juan Carlos Oropeza
  • 47,252
  • 12
  • 78
  • 118
  • 1
    Those sample ranges all have a one minute gap from one to the next. The end point of one range should equal the start point of the next range, then, **DO NOT use BETWEEN** in the joins use >= and < instead. This way there are no gaps or overlaps. – Paul Maxwell Nov 25 '16 at 04:40
  • @Used_By_Already I understand what you say. But I dont see what datetime will be a gap or overlap, can you show me an example? I rather this setup because allow me use `BETWEEN` – Juan Carlos Oropeza Nov 25 '16 at 04:51
  • please see alternative in your answer. In the alternative there is no one second gap (sorry I meant 1 second earlier) and also there is no overlap caused by using "between" (between INCLUDES both end points >= and <=) e.g. ref: https://sqlblog.org/2011/10/19/what-do-between-and-the-devil-have-in-common do note that MySQL now supports time units smaller than 1 second – Paul Maxwell Nov 25 '16 at 05:09
  • @Used_By_Already I told you I understand your alternative. And that is a problem only if use ms. But I will check bertrand blog later. And dont really mind the update, but if you dont like my answer you should post that alternative as an alternative answer. – Juan Carlos Oropeza Nov 25 '16 at 05:18
  • but I do like the concept, it's just the gap of 1 second in the date range that leads to problems (if for instance someone with sub-second precision does uses your answer) pl. feel free to remove my material from your answer I wasn't trying to take it over I just needed the formatting really. (+was using a phone at the time) – Paul Maxwell Nov 25 '16 at 07:32
  • The latter alternative is better because it is impervious to `DATE` vs `DATETIME` vs `DATETIME(6)` (microsecond resolution). – Rick James Nov 28 '16 at 01:41
0

This is a variant of the range proposal by Juan Carlos Oropeza. I suspect that actually storing just the 15 minute averages in its own table makes sense, but here I have applied it as requested. Note however I cannot bring myself to call a column a reserved word like "datetime" and have so I used "pricedatetime" instead.

There is an inherent assumption that you do not need more than 1000 15 minute intervals, if you do then you need to adjust the number of cross joins etc. to expand the Cartesian product to something larger.

Also assuming this is only required when new data is added, the logic will re-process all rows for the date where the stored average is null.

update table1
inner join (
    select 
           dr.start_date
         , dr.end_date
         , avg(t.price) avg_price
    from table1 t
    inner join (
          SELECT
                  (x.a + (y.b*10)+(z.c*100))+ 1 n
                , TRIM(min_date + INTERVAL 15*(x.a + (y.b*10)+(z.c*100)) MINUTE) start_date
                , TRIM(min_date + INTERVAL 15*(x.a + (y.b*10)+(z.c*100)) MINUTE) + INTERVAL 15 MINUTE end_date
          FROM (
                select 
                       cast(date(min(pricedatetime)) as datetime) min_date
                     , cast(date(max(pricedatetime)) as datetime) max_date
                from Table1 
                where 15_MIN_AVERAGE IS NULL
               ) m
          CROSS JOIN (
                    SELECT 0 AS a UNION ALL
                    SELECT 1 UNION ALL SELECT 2 UNION ALL SELECT 3 UNION ALL SELECT 4 UNION ALL  
                    SELECT 5 UNION ALL SELECT 6 UNION ALL SELECT 7 UNION ALL SELECT 8 UNION ALL 
                    SELECT 9
               ) x
          CROSS JOIN (
                    SELECT 0 AS b UNION ALL
                    SELECT 1 UNION ALL SELECT 2 UNION ALL SELECT 3 UNION ALL SELECT 4 UNION ALL  
                    SELECT 5 UNION ALL SELECT 6 UNION ALL SELECT 7 UNION ALL SELECT 8 UNION ALL 
                    SELECT 9
               ) y
          CROSS JOIN (
                    SELECT 0 AS c UNION ALL
                    SELECT 1 UNION ALL SELECT 2 UNION ALL SELECT 3 UNION ALL SELECT 4 UNION ALL  
                    SELECT 5 UNION ALL SELECT 6 UNION ALL SELECT 7 UNION ALL SELECT 8 UNION ALL 
                    SELECT 9
               ) z
          where TRIM(min_date + INTERVAL 15*((x.a + (y.b*10)+(z.c*100))-1) MINUTE) < max_date
        ) dr on t.pricedatetime >= dr.start_date and t.pricedatetime <  dr.end_date
    group by
           dr.start_date
         , dr.end_date
    ) g on table1.pricedatetime >= g.start_date and table1.pricedatetime < g.end_date
set `15_MIN_AVERAGE` = g.avg_price
;

Please note that I have very deliberately avoided to use of between. Between is NOT a good option for date ranges as it includes both the lower and higher boundaries, and as a consequence of this it is posisble for rows to be double-counted. Instead simply use a combination of >= with < and that problem disappears totally. Also note with this approach it does not matter if the pricedatetime column is precise to a second or sub-second, if you avoid using between the ranges will remain accurate.

above suggestion available as a working demo at: http://sqlfiddle.com/#!9/299150/1

Paul Maxwell
  • 33,002
  • 3
  • 32
  • 51
0

Plan A: Upgrade to MariaDB 10.2 and use the "Windowing functions" to do such a "moving average".

Plan B: Every 15 seconds look back 15 minutes in the table and compute all the averages for the current 3 rows. Store them (via INSERT, not UPDATE) into a separate table. You should never need to recompute them. By having an index on datetime, you don't need to look at more than 180 rows to do the computations. This will take much less than the 15 seconds you have before you need to compute the next set of averages.

Don't have an id on the new table, nor the old table. You have a perfectly good 'natural' primary key with (name, datetime). You can JOIN the "summary table" with the original table if you need both the price and average.

Plan C: Switch to an "exponential moving average"; it is much simpler to compute: The new average is

old_average + 0.1 * (new_value - old_average)

Pick a smaller value (than 0.1) if you want the average to smooth things out more; a larger value to make it respond faster.

Rick James
  • 135,179
  • 13
  • 127
  • 222