3

This is a tough one I've been fighting with for a bit. I have a table called action_events that stores event_name and timestamp:

CREATE TABLE action_events (
  id SERIAL PRIMARY KEY,
  timestamp TIMESTAMP NOT NULL,
  event_name VARCHAR(255) NOT NULL
);

I want to grab the most recent "streak" for an event with a given name.

A "streak" is the number of consecutive days that the event has occurred at least once. An event may occur more than once a day. Big gotcha: The streak should also take into account a given timezone.

Using this fiddle: https://www.db-fiddle.com/f/4jyoMCicNSZpjMt4jFYoz5/7828

Given a query for "exercise" in "MST" timezone, I'd expect the streak to be:

Streak Count Name TimeZone Start Date End Date
13 "exercise" "MST" 2023-02-18 09:00:00 2023-02-30 09:00:00

Even though the streak ended a month ago, it should still show as the most recent streak.

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
Rockster160
  • 1,579
  • 1
  • 15
  • 30

1 Answers1

2

Plain timestamp data is ignorant of time zones. A given time zone is meaningless while we don't know the time zone of the timestamp data.
I'll assume your timestamps are supposed to represent UTC. Should really be timestamptz to avoid ambiguity and an additional transformation.

For few rows per event

Plain SQL. The query won't get much snappier than this:

SELECT sum(ct)       AS streak_count
     , 'exercise'    AS event_name
     , 'MST'         AS timezone
     , min(min_ts)   AS start_date
     , max(max_ts)   AS end_date
FROM  (
   SELECT *, the_day - row_number() OVER (ORDER BY the_day)::int AS streak
   FROM  (
      SELECT (timestamp AT TIME ZONE 'UTC' AT TIME ZONE 'MST')::date AS the_day
           , count(*) AS ct
           , min(timestamp) AS min_ts
           , max(timestamp) AS max_ts           
      FROM   action_events
      WHERE  event_name = 'exercise'
      GROUP  BY 1
      ) sub1
   ) sub2
GROUP  BY streak
ORDER  BY end_date DESC
LIMIT  1;

fiddle

Step-by-step

sub1

Filter only the event_name of interest right away - event_name = 'exercise' in the example.
timestamp AT TIME ZONE 'UTC' produces timestamptz.
... AT TIME ZONE 'MST' then returns the corresponding timestamp at 'MST'. See:

Cast to ::date (the_day), and group by that.
So we get (at most) one row per day and carry along count, min, and max.

sub2

To identify streaks, simply subtract the row number (type integer!) from the_day (type date!). See:

This produces the same (otherwise meaningless) day for consecutive days.
(sub1 and sub2 could be merged, but that'd be unwieldy.)

Outer SELECT

Aggregate per streak, take sum of counts, min of min, and max of max. Sort by min (or max, same result) descending and take the first row (LIMIT 1).
Voilá.

In the result, start_date and end_date represent original timestamps (in UTC). You might want to show timestamps at 'MST' instead. You didn't say.

No minimum streak length has been defined, so it could be just a single row.

But since this processes all rows for the given event, it does not scale well for many rows.

Many rows per event

It will be (much) faster to start with the latest row and loop until a gap is encountered. While no minimum streak length is required, we can't go wrong.

Demonstrating a PL/pgSQL function:

CREATE OR REPLACE FUNCTION f_latest_streak(
   INOUT name        text
 , INOUT timezone    text
 , OUT   steak_count int
 , OUT   start_date  timestamp
 , OUT   end_date    timestamp)
  LANGUAGE plpgsql STRICT PARALLEL SAFE STABLE AS
$func$
DECLARE
   _day_start   timestamp;
   _start_date  timestamp;
   _streak_step int;
BEGIN
   -- get end & UTC time for start of latest day at given time zone
   SELECT max(a.timestamp)
        , date_trunc('day', max(a.timestamp) AT TIME ZONE 'UTC' AT TIME ZONE timezone)
          AT TIME ZONE 'UTC' AT TIME ZONE 'UTC'  -- sic!
   INTO   end_date, _day_start
   FROM   action_events a
   WHERE  event_name = name;
   
   IF NOT FOUND THEN  -- no rows at all
      RETURN;
   END IF;
   
   -- get count for first day
   SELECT count(*)::int, min(timestamp)
   INTO   steak_count, start_date
   FROM   action_events a
   WHERE  a.event_name = name  -- careful with naming conflicts!
   AND    a.timestamp >= _day_start;

   -- more days?
   LOOP
      SELECT count(*)::int, min(timestamp)
      INTO   _streak_step, _start_date
      FROM   action_events a
      WHERE  a.event_name = name  -- careful with naming conflicts!
      AND    a.timestamp >= _day_start - interval '1 day'
      AND    a.timestamp <  _day_start;
      
      IF _streak_step = 0 THEN  -- streak ends here
         RETURN;
      ELSE
         steak_count := steak_count + _streak_step;         
         start_date  := _start_date;
         _day_start  := _day_start - interval '1 day';
      END IF;
   END LOOP;
END
$func$;

Call:

SELECT * FROM f_latest_streak('exercise', 'MST');

A multicolumn index on (event_name, timestamp) will make the function fast.

You should be comfortable with PL/pgSQL to play with this.

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
  • Will the query planner deal with the `ORDER BY` that is a `max` of a `max` to scan into the table by `timestamp`, or will it scan the whole table? I would have tried to identify the most recent streak with something like `SELECT * FROM sub1 WHERE the_day >= (SELECT the_day FROM sub1 WHERE lead(the_day) <> the_day - 1 ORDER BY the_day DESC LIMIT 1)` (ie. find the most recent gap, then select everything after that) – Bergi Mar 23 '23 at 01:25
  • @Bergi: The pure SQL solution won't scan the whole table if the filter on `event_name` is selective enough to engage an index on the same column. But that still won't scale well for many rows per event. (I expect your alternative approach to processes all rows for the event, too). I added a performant solution for many rows per event. – Erwin Brandstetter Mar 23 '23 at 03:33