2

How can we find X consecutive dates (using by hour) that meet a condition?

EDIT: here is the SQL fiddle http://sqlfiddle.com/#!17/44928/1

Example:

Find 3 consecutive dates where aa < 2 and bb < 6 and cc < 7

Given this table called weather:

timestamp aa bb cc
01/01/2000 00:00 1 5 5
01/01/2000 01:00 5 5 5
01/01/2000 02:00 1 5 5
01/01/2000 03:00 1 5 5
01/01/2000 04:00 1 5 5
01/01/2000 05:00 1 5 5

Answer should return the 3 records from 02:00, 03:00, 04:00.

How can we do this in Ruby on Rails - or directly in SQL if that is better?

I started working on a method based on this answer: Detect consecutive dates ranges using SQL

def consecutive_dates
  the_query = "WITH t AS (
    SELECT timestamp d,ROW_NUMBER() OVER(ORDER BY timestamp) i
    FROM @d
    GROUP BY timestamp
  )
  SELECT MIN(d),MAX(d)
  FROM t
  GROUP BY DATEDIFF(hour,i,d)"

  ActiveRecord::Base.connection.execute(the_query)
end

But I was unable to get it working.

viktorsmari
  • 179
  • 1
  • 3
  • 12

2 Answers2

3

)This is a gaps-and-islands problem. Islands are adjacent records that match the condition, and you want islands that are at least 3 records long.

Here is one approach that uses a window count that increments every time value that does not match the condition is met to define the groups. We can then count how many rows there are in each group, and use that information to filter.

select  *
from (
    select t.*, count(*) over(partition by a, grp) cnt
    from (
        select t.*,
            count(*) filter(where b <= 4) over(partition by a order by timestamp) grp
        from mytable t
    ) t
) t
where cnt >= 3
GMB
  • 216,147
  • 25
  • 84
  • 135
  • Thank you, this code actually runs! I renamed the variable `a` to `aa` so it won't get mistaken with the SQL syntax. I also added a third variable, `cc` to help me understand the syntax. How would we add the third variable `cc` in? – viktorsmari Dec 12 '20 at 11:30
  • @viktorsmari: you would just change the `filter` clause of the inner window function. Note, however, that you should not be significantly changing the question after answers are posted. This invalidates the answer, which can lead to downvotes. – GMB Dec 12 '20 at 12:28
  • Yes, sorry, I just did that to make it a bit clearer, and I commented to let you know so you can change your `a`s to `aa`s. Sorry for the inconvenience. So to clarify, in order to find the `cc` - I could change the filter to: `filter(where b <= 4 AND c < 7)` ? – viktorsmari Dec 12 '20 at 12:35
  • I am just baffled by both the answer and the upvotes. This has no filtering conditions in it. – Gordon Linoff Dec 12 '20 at 13:17
3

Assuming that you have one row every hour, then an easy way to get the first hour where this occurs uses lead():

select t.*
from (select t.*,
             lead(timestamp, 2) over (order by timestamp) as timestamp_2
      from t
      where aa < 2 and bb < 6 and cc < 7
     ) t
where timestamp_2 = timestamp + interval '2 hour';

This filters on the conditions and looks at the rows two rows ahead. If it is two hours ahead, then three rows in a row match the conditions. Note: The above will return both 2020-01-01 02:00 and 2020-01-01 03:00.

From your question you only seem to want the earliest. To handle that, use lag() as well:

select t.*
from (select t.*,
             lag(timestamp) over (order by timestamp) as prev_timestamp
             lead(timestamp, 2) over (order by timestamp) as timestamp_2
      from t
      where aa < 2 and bb < 6 and cc < 7
     ) t
where timestamp_2 = timestamp + interval '2 hour' and
      (prev_timestamp is null or prev_timestamp < timestamp - interval '1' hour);

You can generate the additional hours use generate_series() if you really need the original rows:

select t.timestamp + n.n * interval '1 hour', aa, bb, cc
from (select t.*,
             lead(timestamp, 2) over (order by timestamp) as timestamp_2
      from t
      where aa < 2 and bb < 6 and cc < 7
     ) t cross join lateral
     generate_series(0, 2) n
where timestamp_2 = timestamp + interval '2 hour';

Your data seems to have precise timestamps based on the question, so the timestamp equalities will work. If the real data has more fuzziness, then the queries can be tweaked to take this into account.

mrtnrst
  • 41
  • 6
Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
  • Thanks Gordon! I added [sql fiddle](http://sqlfiddle.com/#!9/65c3d9) if you want to test your code, I tried but got errors. Maybe the table name is missing? – viktorsmari Dec 12 '20 at 14:23
  • My bad, I had the SQL fiddle engine on MySQL - not PostgreSQL, updating [here](http://sqlfiddle.com/#!17/5c437/4) – viktorsmari Dec 12 '20 at 14:32
  • We only need the `timestamp` of the **last** consecutive point. We are looking to perform 'activities' that take for example 3 hours to perform, and we need 3 consecutive hours where the conditions (aa, bb, cc) are met. – viktorsmari Dec 12 '20 at 14:39
  • @viktorsmari . . . Your question in no way suggests that you want the *last* point, which would be 05:00 and is not in your result set. This specifically handles the case of three adjacent rows where the conditions are met. – Gordon Linoff Dec 12 '20 at 15:04
  • You are right sorry, I will update the question to make it clearer. Originally I thought we needed the 3 records, but the last consecutive timestamp is enough. Question: If we would need **6** consecutive records, would we only need to change your examples, where it says **2** to **5**? So if we use a variable `consecutive_hours - 1`? – viktorsmari Dec 12 '20 at 15:13
  • @viktorsmari . . . Do not update the question. You have two answers to the question as-is, and that would likely invalidate the answers. Ask a new question that is clear (and highlights exactly what you want so it won't be closed as a duplicate of this question). – Gordon Linoff Dec 12 '20 at 15:21
  • Ok Gordon! The question is to find `X` consecutive dates. Using your first answer, (and my previous comment), would that be a proper way of handling `X` consecutive dates? For example if we would need 6 consecutive dates instead of 3, which parts of your first example would we need to replace? – viktorsmari Dec 12 '20 at 15:44
  • Just replace the "2" with "5". – Gordon Linoff Dec 12 '20 at 15:53
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/225880/discussion-between-viktorsmari-and-gordon-linoff). – viktorsmari Dec 12 '20 at 21:02