5

We have a Postgres table (materialized view) containing around 2 million rows with columns like:

  • start_time (timestampz) - has index
  • end_time (timestampz) - has index

For each row in the table, we would like to add a result column that contains:

  • 1, if the row start and end time range overlaps with any other row
  • 0, if the row start and end time range does not overlap with any other row

What would be an efficient approach to label each row as having overlap (1 or 0)?

EDIT:

The expected output would be something like:

  • row_id
  • has_overlap - boolean or int (1 or 0)
Brylie Christopher Oxley
  • 1,684
  • 1
  • 17
  • 34
  • The closest SO answer we have found was this: https://stackoverflow.com/questions/25733385/postgresql-query-to-detect-overlapping-time-ranges However, it just returns overlapping rows. We specifically need to count the number of overlaps, and use those counts for subsequent reporting/outlier analysis. – Brylie Christopher Oxley Oct 15 '18 at 08:13

1 Answers1

7

I don't think there will be a really fast solution to that, as it does require comparing every row in the table with each and every other row in the table (or at least every other row in the specified range).

Assuming your table's primary key column is named id you could use Postgres' range function to check for overlapping rows:

with check_period (check_range) as (
   values ( tstzrange(timestamptz '2018-10-01 00:00:00', timestamptz '2018-10-14 20:15:00') )
)
select id, 
       start_Time, 
       end_time, 
       exists (select *
        from the_table t2
           cross join check_perioud
        where t2.id <> t1.id 
        and tstzrange(t1.start_time, t1.end_time) && tstzrange(t2.start_time, t2.start_time)
        and tstzrange(t2.start_time, t2.start_time) <@ check_range
       ) has_overlapping_rows
from the_table t1
  cross join check_period
where tstzrange(t1.start_time, t1.end_time) <@ check_range;

The CTE check_period is only there, so that the values for time period you want to analyze are not repeated. If you don't care about repeating them, you can remove it:

select id, 
       start_Time, 
       end_time, 
       exists (select *
        from the_table t2
        where t2.id <> t1.id 
        and tstzrange(t1.start_time, t1.end_time) && tstzrange(t2.start_time, t2.start_time)
        and tstzrange(t2.start_time, t2.start_time) <@ tstzrange(timestamptz '2018-10-01 00:00:00', timestamptz '2018-10-14 20:15:00')
       ) has_overlapping_rows
from the_table t1
where tstzrange(t1.start_time, t1.end_time) <@ tstzrange(timestamptz '2018-10-01 00:00:00', timestamptz '2018-10-14 20:15:00');

You should create an index on the timestamp range to make that quick:

create index on the_table( (tstzrange(start_time, end_time), id );

You can extend the above query to return a count of the overlapping rows rather than a true/false flag:

select id, 
       start_Time, 
       end_time, 
       (select count(*)
        from the_table t2
        where t2.id <> t1.id 
        and tstzrange(t1.start_time, t1.end_time) && tstzrange(t2.start_time, t2.start_time)
        and tstzrange(t2.start_time, t2.start_time) <@ tstzrange(timestamptz '2018-10-01 00:00:00', timestamptz '2018-10-14 20:15:00')
       ) has_overlapping_rows
from the_table t1
where tstzrange(t1.start_time, t1.end_time) <@ tstzrange(timestamptz '2018-10-01 00:00:00', timestamptz '2018-10-14 20:15:00');

However for rows having many overlapping rows, this will be slower, because the count(*) forces the database to inspect all overlapping rows. The exists() solution can stop at the first row found.

  • Also, it is alright if the query you propose is somewhat slow, as we will materialize the results. That way, the data are pre-computed, e.g. on a nightly basis. – Brylie Christopher Oxley Oct 15 '18 at 08:28
  • 1
    Excellent suggestion for the tstzrange index! – Brylie Christopher Oxley Oct 15 '18 at 08:29
  • In your second example, without the CTE, what does the second `and` statement do? Should it be removed as well, as we don't want to hard code any date ranges? – Brylie Christopher Oxley Oct 15 '18 at 08:31
  • @BrylieChristopherOxley: without the CTE you have to repeat the condition on the period you want to analyze, unless you want to compare all rows from the period to check with **all** other rows, i.e. if those rows have any overlapping rows even outside of the period to be checked (btw: the same condition is in the query with the CTE) –  Oct 15 '18 at 08:38