12

I have a table (let's call it Data) with a set of object IDs, numeric values and dates. I would like to identify the objects whose values had a positive trend over the last X minutes (say, an hour).

Example data:

entity_id | value | date

1234      | 15    | 2014-01-02 11:30:00

5689      | 21    | 2014-01-02 11:31:00

1234      | 16    | 2014-01-02 11:31:00

I tried looking at similar questions, but didnt find anything that helps unfortunately...

Tim Zimmermann
  • 6,132
  • 3
  • 30
  • 36
Dan Markhasin
  • 752
  • 2
  • 8
  • 20
  • I assume by positive trend you mean whose value increased over the last hour? – BenM Jan 02 '14 at 12:43
  • Yes, that is correct. This is basically temperature data gathered from various sensors, and I'm trying to detect sensors where the temperature is steadily rising... – Dan Markhasin Jan 02 '14 at 12:46
  • 1
    Also, would it be a suitable solution to simply look at the difference between the first and last value? Or do you want to do some sort of regression over time? 'Cause the former is achievable, the latter is probably not... – John Chrysostom Jan 02 '14 at 13:04
  • 1
    you could use LAG funxtions – Randy Jan 02 '14 at 13:07

2 Answers2

39

You inspired me to go and implement linear regression in SQL Server. This could be modified for MySQL/Oracle/Whatever without too much trouble. It's the mathematically best way of determining the trend over the hour for each entity_id and it will select out only the ones with a positive trend.

It implements the formula for calculating B1hat listed here: https://en.wikipedia.org/wiki/Regression_analysis#Linear_regression

create table #temp
(
    entity_id int,
    value int,
    [date] datetime
)

insert into #temp (entity_id, value, [date])
values
(1,10,'20140102 07:00:00 AM'),
(1,20,'20140102 07:15:00 AM'),
(1,30,'20140102 07:30:00 AM'),
(2,50,'20140102 07:00:00 AM'),
(2,20,'20140102 07:47:00 AM'),
(3,40,'20140102 07:00:00 AM'),
(3,40,'20140102 07:52:00 AM')

select entity_id, 1.0*sum((x-xbar)*(y-ybar))/sum((x-xbar)*(x-xbar)) as Beta
from
(
    select entity_id,
        avg(value) over(partition by entity_id) as ybar,
        value as y,
        avg(datediff(second,'20140102 07:00:00 AM',[date])) over(partition by entity_id) as xbar,
        datediff(second,'20140102 07:00:00 AM',[date]) as x
    from #temp
    where [date]>='20140102 07:00:00 AM' and [date]<'20140102 08:00:00 AM'
) as Calcs
group by entity_id
having 1.0*sum((x-xbar)*(y-ybar))/sum((x-xbar)*(x-xbar))>0
John Chrysostom
  • 3,973
  • 1
  • 34
  • 50
  • 2
    As you might have guessed, this also allows you to sort on the Beta column in the result set to find the entities with the strongest trend of temperature increase. If you're into that sort of thing. – John Chrysostom Jan 02 '14 at 14:55
  • Curious, what is the Beta actually correlating to here (eg. velocity, acceleration)? Can it be mapped to a particular relationship between temperature and time? – Arjun Mehta Jul 24 '18 at 21:02
  • 1
    Beta is a linear best estimate of the change in the value associated with a 1-unit change in time (in this case, 1 second)... – John Chrysostom Jul 25 '18 at 15:22
  • So I re implemented this in mysql, I was wondering if this is a good way to implement "trending content in the last hour" for video plays. I have the exact same data structure where value is the number of plays a video had in that minute. – Nicola Peluchetti Mar 22 '21 at 23:31
1

If someone needs this in Mysql, this is the code that works for me.

datapoint | plays | status_time 

1234      | 15    | 2014-01-02 11:30:00

5689      | 21    | 2014-01-02 11:31:00

1234      | 16    | 2014-01-02 11:31:00

select datapoint, 1.0*sum((x-xbar)*(y-ybar))/sum((x-xbar)*(x-xbar)) as Beta
from
(
     select datapoint,
        avg(plays) over(partition by datapoint) as ybar,
        plays as y,
        avg(TIME_TO_SEC(TIMEDIFF('2021-03-22 21:00:00', status_time))) over(partition by datapoint) as xbar,
        TIME_TO_SEC(TIMEDIFF('2021-03-22 21:00:00', status_time)) as x
    from aggregate_datapoints
    where status_time BETWEEN'2021-03-22 21:00:00' and '2021-03-22 22:00:00'
and type = 'topContent') as calcs
group by datapoint
having 1.0*sum((x-xbar)*(y-ybar))/sum((x-xbar)*(x-xbar))>0
Nicola Peluchetti
  • 76,206
  • 31
  • 145
  • 192