28

My app has a Events table with time-stamped events.

I need to report the count of events during each of the most recent N time intervals. For different reports, the interval could be "each week" or "each day" or "each hour" or "each 15-minute interval".

For example, a user can display how many orders they received each week, day, or hour, or quarter-hour.

1) My preference is to dynamically do a single SQL query (I'm using Postgres) that groups by an arbitrary time interval. Is there a way to do that?

2) An easy but ugly brute force way is to do a single query for all records within the start/end timeframe sorted by timestamp, then have a method manually build a tally by whatever interval.

3) Another approach would be add separate fields to the event table for each interval and statically store an the_week the_day, the_hour, and the_quarter_hour field so I take the 'hit' at the time the record is created (once) instead of every time I report on that field.

What's best practice here, given I could modify the model and pre-store interval data if required (although at the modest expense of doubling the table width)?

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
jpw
  • 18,697
  • 25
  • 111
  • 187
  • a rough idea here but SELECT Count(Event) From Table Where 2 > Datediff(mm,Date, currentDate) > 1 – VoronoiPotato Mar 22 '13 at 18:04
  • I dont see a datediff for postgres, nor do I see how to do N < function() < M – jpw Mar 22 '13 at 18:09
  • 2
    check out these 2 posts, i 'think' this is what you are looking for? - http://stackoverflow.com/questions/12623358/group-by-data-intervals AND http://stackoverflow.com/questions/12045600/postgresql-sql-group-by-time-interval-with-arbitrary-accuracy-down-to-milli-sec – house9 Mar 22 '13 at 18:57
  • A little cursory reading, the subtraction of two dates yields an interval in Postgresql. – VoronoiPotato Mar 22 '13 at 20:55

1 Answers1

53

Luckily, you are using PostgreSQL. The window function generate_series() is your friend.

Test case

Given the following test table (which you should have provided):

CREATE TABLE event(event_id serial, ts timestamp);
INSERT INTO event (ts)
SELECT generate_series(timestamp '2018-05-01'
                     , timestamp '2018-05-08'
                     , interval '7 min') + random() * interval '7 min';

One event for every 7 minutes (plus 0 to 7 minutes, randomly).

Basic solution

This query counts events for any arbitrary time interval. 17 minutes in the example:

WITH grid AS (
   SELECT start_time
        , lead(start_time, 1, 'infinity') OVER (ORDER BY start_time) AS end_time
   FROM  (
      SELECT generate_series(min(ts), max(ts), interval '17 min') AS start_time
      FROM   event
      ) sub
   )
SELECT start_time, count(e.ts) AS events
FROM   grid       g
LEFT   JOIN event e ON e.ts >= g.start_time
                   AND e.ts <  g.end_time
GROUP  BY start_time
ORDER  BY start_time;

The query retrieves minimum and maximum ts from the base table to cover the complete time range. You can use an arbitrary time range instead.

Provide any time interval as needed.

Produces one row for every time slot. If no event happened during that interval, the count is 0.

Be sure to handle upper and lower bound correctly. See:

The window function lead() has an often overlooked feature: it can provide a default for when no leading row exists. Providing 'infinity' in the example. Else the last interval would be cut off with an upper bound NULL.

Minimal equivalent

The above query uses a CTE and lead() and verbose syntax. Elegant and maybe easier to understand, but a bit more expensive. Here is a shorter, faster, minimal version:

SELECT start_time, count(e.ts) AS events
FROM  (SELECT generate_series(min(ts), max(ts), interval '17 min') FROM event) g(start_time)
LEFT   JOIN event e ON e.ts >= g.start_time
                   AND e.ts <  g.start_time + interval '17 min'
GROUP  BY 1
ORDER  BY 1;

Example for "every 15 minutes in the past week"`

Formatted with to_char().

SELECT to_char(start_time, 'YYYY-MM-DD HH24:MI'), count(e.ts) AS events
FROM   generate_series(date_trunc('day', localtimestamp - interval '7 days')
                     , localtimestamp
                     , interval '15 min') g(start_time)
LEFT   JOIN event e ON e.ts >= g.start_time
                   AND e.ts <  g.start_time + interval '15 min'
GROUP  BY start_time
ORDER  BY start_time;

Still ORDER BY and GROUP BY on the underlying timestamp value, not on the formatted string. That's faster and more reliable.

db<>fiddle here

Related answer producing a running count over the time frame:

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
  • I'm still trying to grok this, but it appears another brilliant aspect of this approach is it handles the "zero count" issue... eg, if there is no data in an interval it reports 0 rather than omitting the interval (as other approaches do). – jpw Mar 22 '13 at 18:57
  • 13
    Seriously, in the history of SO that's the Most. Amazing. Answer. Ever. – jpw Mar 22 '13 at 20:52
  • @ErwinBrandstetter Thank you for provided code, it almost perfectly fits in my case, except that I have FK field group_id by which it should be grouped. As example: `events of group_id = 1 occurred 20 times from now() - 7 days to now(); events of group_id = 2 occurred 1 times from now() - 7 days to now();` Can you help me with this? From what I've tried - simple group by on group_id does not give me the right queryset. – Dmitrijs Zubriks May 25 '16 at 00:34
  • @ErwinBrandstetter do you know how to filter these by a rails scope? – Jason Axelson Dec 29 '16 at 21:24
  • @ErwinBrandstetter I find this query is quite slow with millions of records. Is there any way to improve the performance? Which indexes would be most efficient here? – coolboyjules May 07 '18 at 18:45
  • @coolboyjules: I updated various details in this old answer. Indexes depend on your exact use case, but aggregating a whole table with millions of rows cannot be very fast - and index support is limited for this scenario. You might start a new question, comments are not the place. You can always link to this one for context. – Erwin Brandstetter May 08 '18 at 00:31
  • @ErwinBrandstetter thanks for your edit. It is really appreciated and I will give your approach a try without the CTE. I've posted my question here https://stackoverflow.com/questions/50221842/slow-left-join-on-cte-in-postgresql – coolboyjules May 08 '18 at 01:31
  • Just for others reference, here is the same `generate_series` logic abstracted using a function. https://gist.github.com/gajus/b10235149427c1c746d94d86c1befc5d – Gajus May 17 '18 at 23:43