2

I have a User table, where there are the following fields.

| id | created_at | username | 

I want to filter this table so that I can get the number of users who have been created in a datetime range, separated into N intervals. e.g. for users having created_at in between 2019-01-01T00:00:00 and 2019-01-02T00:00:00 separated into 2 intervals, I will get something like this.

_______________________________
|       dt            | count |
-------------------------------
| 2019-01-01T00:00:00 |   6   |
| 2019-01-01T12:00:00 |   7   |
-------------------------------

Is it possible to do so in one hit? I am currently using my Django ORM to create N date ranges and then making N queries, which isn't very efficient.

Mehran
  • 1,264
  • 10
  • 27

2 Answers2

1

Generate the times you want and then use left join and aggregation:

select gs.ts, count(u.id)
from generate_series('2019-01-01T00:00:00'::timestamp,
                     '2019-01-01T12:00:00'::timestamp,
                     interval '12 hour'
                    ) gs(ts) left join
     users u
     on u.created_at >= gs.ts and
        u.created_at < gs.ts + interval '12 hour'
group by 1
order by 1;

EDIT:

If you want to specify the number of rows, you can use something similar:

from generate_series(1, 10, 1) as gs(n) cross join lateral
     (values ('2019-01-01T00:00:00'::timestamp + (gs.n - 1) * interval '12 hour')
     ) v(ts) left join
     users u
     on u.created_at >= v.ts and
        u.created_at < v.ts + interval '12 hour'
Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
  • Is it possible to give number of rows needed in stead of time interval? If my end date was 2019-01-07 I would get 14 rows for 12 hours whereas I only needed two rows. – Mehran Sep 07 '20 at 13:45
  • @Mehran . . . You can generate numbers and then construct the timestamps. I edited the answer to show one method for doing this. – Gordon Linoff Sep 07 '20 at 14:56
  • yeah, I've done exactly that but not in your way. I've gotten the number of hours required to make the data range divided into the number of intervals I want. So for 4 days and 4 intervals, the hours will be 24 hours, and so on. – Mehran Sep 07 '20 at 16:08
  • is it possible to get the value 0 when the returned count is null? – Mehran Sep 10 '20 at 11:53
  • @Mehran . . . This should not return `NULL` for any values. `COUNT()` does not return `NULL` (well, if can if it is in a subquery that returns no rows, but that is not really the `COUNT()`). – Gordon Linoff Sep 10 '20 at 12:36
1

In Postgres, there is a dedicated function for this (several overloaded variants, really): width_bucket().

One additional difficulty: it does not work on type timestamp directly. But you can work with extracted epoch values like this:

WITH cte(min_ts, max_ts, buckets) AS (  -- interval and nr of buckets here
   SELECT timestamp '2019-01-01T00:00:00'
        , timestamp '2019-01-02T00:00:00'
        , 2
   )
SELECT width_bucket(extract(epoch FROM t.created_at)
                  , extract(epoch FROM c.min_ts)
                  , extract(epoch FROM c.max_ts)
                  , c.buckets) AS bucket
     , count(*) AS ct
FROM   tbl t
JOIN   cte c ON t.created_at >= min_ts  -- incl. lower
            AND t.created_at <  max_ts  -- excl. upper
GROUP  BY 1
ORDER  BY 1;

Empty buckets (intervals with no rows in it) are not returned at all. Your comment seems to suggest you want that.

Notably, this accesses the table once - as requested and as opposed to generating intervals first and then joining to the table (repeatedly).

See:

That does not yet include effective bounds, just bucket numbers. Actual bounds can be added cheaply:

WITH cte(min_ts, max_ts, buckets) AS (  -- interval and nr of buckets here
   SELECT timestamp '2019-01-01T00:00:00'
        , timestamp '2019-01-02T00:00:00'
        , 2
   )
SELECT b.*
     , min_ts + ((c.max_ts - c.min_ts) / c.buckets) * (bucket-1) AS lower_bound
FROM  (
   SELECT width_bucket(extract(epoch FROM t.created_at)
                     , extract(epoch FROM c.min_ts)
                     , extract(epoch FROM c.max_ts)
                     , c.buckets) AS bucket
        , count(*) AS ct
   FROM   tbl t
   JOIN   cte c ON t.created_at >= min_ts  -- incl. lower
               AND t.created_at <  max_ts  -- excl. upper
   GROUP  BY 1
   ORDER  BY 1
   ) b, cte c;

Now you only change input values in the CTE to adjust results.

db<>fiddle here

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228