0

I need help with this SQL query. I have a big table with the following schema:

  • time_start (timestamp) - start time of the measurement,
  • duration (double) - duration of the measurement in seconds,
  • count_event1 (int) - number of measured events of type 1,
  • count_event2 (int) - number of measured events of type 2

I am guaranteed that the no rows will overlap - in SQL talk, there are no two rows such that time_start1 < time_start2 AND time_start1 + duration1 > time_start2.

I would like to design an efficient SQL query which would group the measurements by some arbitrary time period (I call it the group_period), for instance 3 hours. I have already tried something like this:

SELECT
    ROUND(time_start/group_period,0) AS time_period,
    SUM(count_event1) AS sum_event1,
    SUM(count_event2) AS sum_event2 
FROM measurements
GROUP BY time_period;

However, there seems to be a problem. If there is a measurement with duration greater than the group_period, I would expect such measurement to be grouped into all time period it belongs to, but since the duration is never taken into account, it gets grouped only into the first one. Is there a way to fix this?

Performance is of concern to me because in time, I expect the table size to grow considerably reaching millions, possibly tens or hundreds of millions of rows. Do you have any suggestions for indexes or any other optimizations to improve the speed of this query?

Brian Tompsett - 汤莱恩
  • 5,753
  • 72
  • 57
  • 129
Petr Mánek
  • 1,046
  • 1
  • 9
  • 24
  • I'm afraid `GROUP BY` alone will not give you what you want, as any row can only be included in a group only once before aggregation happens. You'll have to perform a subquery first and join your table against something so that long durations will give you multiple rows - then you can group by. I can't think of anything to join against that would be even remotely efficient, though. Recursive query that generates a sequence comes to mind, but I'm not skilled enough in postgresql to write something like that, and the performance is questionable. – Timekiller Dec 28 '15 at 18:32
  • 1
    If you're not limited to a single query, you can create another table, then prepopulate it with timestamps that differ by `group_period`, then join on condition that `prepopulated_timestamp between time_start and (time_start + duration)`. – Timekiller Dec 28 '15 at 18:35
  • @Timekiller prepopulating seems to get the task done since I'm not limited by the number of queries (I'm performing everything as a transaction) - I only worry about performance side of this approach, will it be acceptable if the base table has hundreds of millions of rows? – Petr Mánek Dec 28 '15 at 18:45
  • @Timekiller I have followed your advice, check out my answer http://stackoverflow.com/a/34501634/1115613 – Petr Mánek Dec 28 '15 at 22:21

2 Answers2

0

Based on Timekiller's advice, I have come up with the following query:

-- Since there's a problem with declaring variables in PostgreSQL,
-- we will be using aliases for the arguments required by the script.

-- First some configuration:
--   group_period = 3600   -- group by 1 hour (= 3600 seconds)
--   min_time = 1440226301 -- Sat, 22 Aug 2015 06:51:41 GMT
--   max_time = 1450926301 -- Thu, 24 Dec 2015 03:05:01 GMT

-- Calculate the number of started periods in the given interval in advance.
--   period_count = CEIL((max_time - min_time) / group_period)

SET TIME ZONE UTC;
BEGIN TRANSACTION;

-- Create a temporary table and fill it with all time periods.
CREATE TEMP TABLE periods (period_start TIMESTAMP)
    ON COMMIT DROP;
INSERT INTO periods (period_start)
    SELECT to_timestamp(min_time + group_period * coefficient)
    FROM generate_series(0, period_count) as coefficient;

-- Group data by the time periods.
-- Note that we don't require exact overlap of intervals:
--   A. [period_start, period_start + group_period]
--   B. [time_start, time_start + duration]
-- This would yield the best possible result but it would also slow
-- down the query significantly because of the part B.
-- We require only: period_start <= time_start <= period_start + group_period
SELECT
    period_start,
    COUNT(measurements.*) AS count_measurements,
    SUM(count_event1) AS sum_event1,
    SUM(count_event2) AS sum_event2
FROM periods
LEFT JOIN measurements
ON time_start BETWEEN period_start AND (period_start + group_period)
GROUP BY period_start;

COMMIT TRANSACTION;

It does exactly what I was going for, so mission accomplished. However, I would still appreciate if anybody could give me some feedback to the performance of this query for the following conditions:

  • I expect the measurements table to have about 500-800 million rows.
  • The time_start column is primary key and has unique btree index on it.
  • I have no guarantees about min_time and max_time. I only know that group period will be chosen so that 500 <= period_count <= 2000.
Petr Mánek
  • 1,046
  • 1
  • 9
  • 24
  • Now that I'm fully awake and think of it, `time_start BETWEEN period_start AND (period_start + group_period)` is not sufficient, since it won't give you multiple rows for long-running tasks. You indeed need to overlap the intervals, fortunately it's simpler than it loops: `time_start <= (period_start + group_period) AND period_start <= (time_start + duration)` , see http://stackoverflow.com/a/117977/1375470 – Timekiller Dec 29 '15 at 06:13
0

(This turned out way too large for a comment, so I'll post it as an answer instead).

Adding to my comment on your answer, you probably should go with getting best results first and optimize later if it turns out to be slow.

As for performance, one thing I've learned while working with databases is that you can't really predict performance. Query optimizers in advanced DBMS are complex and tend to behave differently on small and large data sets. You'll have to get your table filled with some large sample data, experiment with indexes and read the results of EXPLAIN, there's no other way.

There are a few things to suggest, though I know Oracle optimizer much better than Postgres, so some of them might not work.

  • Things will be faster if all fields you're checking against are included in the index. Since you're performing a left join and periods is a base, there's probably no reason to index it, since it'll be included fully either way. duration should be included in the index though, if you're going to go with proper interval overlap - this way, Postgres won't have to fetch the row to calculate the join condition, index will suffice. Chances are it will not even fetch the table rows at all since it needs no other data than what exists in indexes. I think it'll perform better if it's included as the second field to time_start index, at least in Oracle it would, but IIRC Postgres is able to join indexes together, so perhaps a second index would perform better - you'll have to check it with EXPLAIN.

  • Indexes and math don't mix well. Even if duration is included in the index, there's no guarantee it will be used in (time_start + duration) - though, again, look at EXPLAIN first. If it's not used, try to either create a function-based index (that is, include time_start + duration as a field), or alter the structure of the table a bit, so that time_start + duration is a separate column, and index that column instead.

  • If you don't really need left join (that is, you're fine with missing empty periods), then use inner join instead - optimizer will likely start with a larger table (measurements) and join periods against it, possibly using hash join instead of nested loops. If you do that, than you should also index your periods table in the same fashion, and perhaps restructure it the same way, so that it contains start and end periods explicitly, as optimizer has even more options when it doesn't have to perform any operations on the columns.

  • Perhaps the most important, if you have max_time and min_time, USE IT to limit the results of measurements before joining! The smaller your sets, the faster it will work.

Timekiller
  • 2,946
  • 2
  • 16
  • 16