How to create an efficient query which will count of the records by a specific time interval?

Question

Which database I use?

I use a PostgreSQL 9.5.

What I need?

This's a part of my data_store tables:

  id |          starttime
-----+----------------------------
 185 | 2011-09-12 15:24:03.248+02
 189 | 2011-09-12 15:24:03.256+02    
 312 | 2011-09-12 15:24:06.112+02
 313 | 2011-09-12 15:24:06.119+02
 450 | 2011-09-12 15:24:09.196+02
 451 | 2011-09-12 15:24:09.203+02
 452 | 2011-09-12 15:24:09.21+02
 ... |            ...

I would like to create the query which will count of the records by the specific time interval. For example, for the 4 seconds time interval - the query should return to me something like this:

    starttime-from   |    starttime-to     |  count
---------------------+---------------------+---------
 2011-09-12 15:24:03 | 2011-09-12 15:24:07 |    4
 2011-09-12 15:24:07 | 2011-09-12 15:24:11 |    3
 2011-09-12 15:24:11 | 2011-09-12 15:24:15 |    0
         ...         |         ...         |   ...

The most important things:

The time interval depends on the user's choice. It could be 1 second, 37 seconds, 50 minutes or some mix: 2 month and 30 mintues. The available units for the time interval: millisecond, second, minute, hour, day, month, year. How you see, I need some generic/universal query for that BUT I could also create several query for each unit - it isn't a problem.
The query should be efficient, because I work in a large database (20 million rows and more but in query I use only a part of this database, for example: 1 million).

The question is: How should the query look like to achieve that?

I tried to convert the solutions which I found in the following threads, but I didn't succeed:

What I have?

I deleted this section of my post for greater transparency of the post. This section wasn't necessary to give an answer my question. If you want to see what here was, look at the history of the post.

Gordon Linoff · Accepted Answer · 2016-08-06T14:33:12.297

2

Your query seems complicated. You only need to generate the sequence of times and then use left join to bring them together . . . and aggregate:

select g.ts,  g.ts + interval '4 second', count(ds.id)
from (select generate_series(min(starttime), max(strttime), interval '4 second') as ts
      from data_store
     ) g left join
     data_store ds
     on ds.starttime >= g.ts and ds.starttime < g.ts + interval '4 second'
group by g.ts
order by g.ts;

Note: If you want the interval to begin on an exact second (and not have some strange number of milliseconds 999 times out of 1000), then use date_trunc().

EDIT:

It might be worth seeing if a correlated subquery is faster:

select gs.ts,
       (select count(*)
        from data_store ds
        where ds.starttime >= g.ts and ds.starttime < g.ts + interval '4 second'
       ) as cnt
from (select generate_series(min(starttime), max(strttime), interval '4 second') as ts
      from data_store
     ) g;

edited Aug 06 '16 at 14:33

answered Aug 05 '16 at 18:04

Gordon Linoff

1,242,037
58
646
786

I was sure that my query is too complicated. Your query is proving that. Thanks for your help, it works exactly like I wanted. Of course I'll accept this answer, but before that could you tell if it's possible to make your query more efficient? If I use some large interval like `60 minutes`, it works even fast but for small interval like `4 seconds` - after 30 minutes of waiting, I gave up. I know that it depends on a lot of things. First of all I've created the B-Tree index on the `starttime` column but it doesn't help. Do you have any idea, how can I increase of speed your query? – Robert Aug 05 '16 at 23:16
Robert . . . If the above edit doesn't work, ask another question about performance -- include the query that gets the right answer but doesn't perform well. There may be other approaches, but I think they confuse the answer to this question. – Gordon Linoff Aug 06 '16 at 14:34
The second query is slower than the first, so I'll stay with the first query. – Robert Aug 07 '16 at 13:51
By the way: based on your first query and my query, finally I've found the faster query. I've just posted an answer with this query, so if someone wants to try it, look [here](http://stackoverflow.com/a/38818684/3616467) – Robert Aug 07 '16 at 21:40

score 1 · Answer 2 · answered Aug 05 '16 at 18:42

If it helps, I use a UDF to create dynamic date/time ranges.

Use the results in a Join on SomeDate>=DateR1 and SomeDate

The Range, DatePart, and Increment are parameters

Declare @Date1 DateTime = '2011-09-12 15:24:03 '
Declare @Date2 DateTime = '2011-09-12 15:30:00 '
Declare @DatePart varchar(25)='SS'
Declare @Incr int=3


Select DateR1 = RetVal
    ,DateR2 = LEAD(RetVal,1,@Date2) OVER (ORDER BY RetVal)
From (Select * from [dbo].[udf-Create-Range-Date](@Date1,@Date2,@DatePart,@Incr) ) A
Where RetVal<@Date2

Returns

DateR1                  DateR2
2011-09-12 15:24:03.000 2011-09-12 15:24:06.000
2011-09-12 15:24:06.000 2011-09-12 15:24:09.000
2011-09-12 15:24:09.000 2011-09-12 15:24:12.000
2011-09-12 15:24:12.000 2011-09-12 15:24:15.000
2011-09-12 15:24:15.000 2011-09-12 15:24:18.000
2011-09-12 15:24:18.000 2011-09-12 15:24:21.000
...
2011-09-12 15:29:48.000 2011-09-12 15:29:51.000
2011-09-12 15:29:51.000 2011-09-12 15:29:54.000
2011-09-12 15:29:54.000 2011-09-12 15:29:57.000
2011-09-12 15:29:57.000 2011-09-12 15:30:00.000

The UDF

CREATE FUNCTION [dbo].[udf-Create-Range-Date] (@DateFrom datetime,@DateTo datetime,@DatePart varchar(10),@Incr int)

Returns 
@ReturnVal Table (RetVal datetime)

As
Begin
    With DateTable As (
        Select DateFrom = @DateFrom
        Union All
        Select Case @DatePart
               When 'YY' then DateAdd(YY, @Incr, df.dateFrom)
               When 'QQ' then DateAdd(QQ, @Incr, df.dateFrom)
               When 'MM' then DateAdd(MM, @Incr, df.dateFrom)
               When 'WK' then DateAdd(WK, @Incr, df.dateFrom)
               When 'DD' then DateAdd(DD, @Incr, df.dateFrom)
               When 'HH' then DateAdd(HH, @Incr, df.dateFrom)
               When 'MI' then DateAdd(MI, @Incr, df.dateFrom)
               When 'SS' then DateAdd(SS, @Incr, df.dateFrom)
               End
        From DateTable DF
        Where DF.DateFrom < @DateTo
    )

    Insert into @ReturnVal(RetVal) Select DateFrom From DateTable option (maxrecursion 32767)

    Return
End

-- Syntax Select * from [dbo].[udf-Create-Range-Date]('2016-10-01','2020-10-01','YY',1) 
-- Syntax Select * from [dbo].[udf-Create-Range-Date]('2016-10-01','2020-10-01','DD',1) 
-- Syntax Select * from [dbo].[udf-Create-Range-Date]('2016-10-01','2016-10-31','MI',15) 
-- Syntax Select * from [dbo].[udf-Create-Range-Date]('2016-10-01','2016-10-02','SS',1)

Thanks for your help. Maybe I'll try that. – Robert Aug 06 '16 at 06:47 — Robert, Aug 06 '16 at 06:47

Robert · Answer 3 · 2016-08-19T13:03:16.277

Improvement of the query in selected answer.

I've just improved the query which you can find in the selected answer.

The final query is as follows:

SELECT gp.tp AS starttime_from, gp.tp + interval '4 second' AS starttime_to, count(ds.id)
FROM (SELECT generate_series(min(starttime),max(starttime), interval '4 second') as tp
      FROM data_store
      WHERE id_user_table=1 and sip='147.32.84.138'
      ORDER BY 1
     ) gp 
     LEFT JOIN data_store ds 
     ON ds.id_user_table=1 and ds.sip='147.32.84.138' 
        and ds.starttime >= gp.tp and ds.starttime < gp.tp + interval '4 second'
GROUP BY starttime_from

I've moved ORDER BY to the subquery. Now it's a little faster. I've also added requried columns in the WHERE clause. Finally, I've created the multicolumns index on the columns which I always use in the query:

CREATE INDEX my_index ON data_store (id_user_table, sip, starttime);

At the moment the query is very fast. Note that: for the very small time intervals, the result of the query includes a lot of zero count rows. These rows eat up space. In this case the query should include the HAVING count(ds.id) > 0 restriction but then you've to handle of these 0 on the client side.

The another solution

This solution isn't as fast as the previous but the below query isn't use the multicolumns index and it's still fast.

Two important things in the query which you can find at the end of this answer:

The 'second' is precision to which to truncate the input value. You can also choose other precision like: millisecond,minute,day, etc.
The '4 second' is time interval. The time interval can have other units like millisecond, minute, day, etc.

Here you can find explanation of the query:

The generate_period query generates the intervals which starting from the specified datetime to the specific datetime. You can indicate this specific datetime manually or by means of your column of the table (like in my case). For 4 seconds interval time interval, the query returns:
```
          tp
---------------------
 2011-09-12 15:24:03
 2011-09-12 15:24:07
 2011-09-12 15:24:11
         ...
```
The data_series query counts the records for specific precision of the datetime: for 1 second time interval, for 1 day time interval, etc. In my case the specific precision is 'second', so for 1 second time interval but the result of the select operation doesn't include the 0 value for the datetime which doesn't occur. In my case, the data_series query returns:
```
       starttime     |    ct
---------------------+-----------
 2011-09-12 15:24:03 |     2
 2011-09-12 15:24:06 |     2
 2011-09-12 15:24:09 |     3     
         ...         |    ...
```

Finally, the last part of the query sum up the ct column for the specific periods of time. The query returns this:

    starttime-from   |    starttime-to     |   ct
---------------------+---------------------+---------
 2011-09-12 15:24:03 | 2011-09-12 15:24:07 |    4
 2011-09-12 15:24:07 | 2011-09-12 15:24:11 |    3
 2011-09-12 15:24:11 | 2011-09-12 15:24:15 |    0
         ...         |         ...         |   ...

Here's the query:

WITH generate_period AS(

    SELECT generate_series(date_trunc('second',min(starttime)), 
                           date_trunc('second',max(starttime)), 
                           interval '4 second') as tp
    FROM data_store 
    WHERE id_user_table=1 --other restrictions

), data_series AS(

    SELECT date_trunc('second', starttime) AS starttime, count(*) AS ct
    FROM data_store  
    WHERE id_user_table=1 --other restrictions
    GROUP  BY 1

)

SELECT gp.tp AS starttime-from, 
       gp.tp + interval '4 second' AS starttime-to, 
       COALESCE(sum(ds.ct),0) AS ct
FROM  generate_period gp
LEFT JOIN data_series ds ON date_trunc('second',ds.starttime) >= gp.tp 
                        and date_trunc('second',ds.starttime) < gp.tp + interval '4 second'
GROUP BY 1
ORDER BY 1;