PostgreSQL query to delete records with overlapping times while preserving the earliest?

Question

I'm trying to figure out a way to delete records with overlapping times but I'm unable to figure out a simple and elegant way of keeping all but one of those records which overlap. This question is similar to this one but with a few differences. Our table looks something like:

╔════╤═══════════════════════════════════════╤══════════════════════════════════════╤════════╤═════════╗
║ id │ start_time                            │ end_time                             │ bar    │ baz     ║
╠════╪═══════════════════════════════════════╪══════════════════════════════════════╪════════╪═════════╣
║ 0  │ Mon, 18 Dec 2017 16:08:33 UTC +00:00  │ Mon, 18 Dec 2017 17:08:33 UTC +00:00 │ "ham"  │ "eggs"  ║
╟────┼───────────────────────────────────────┼──────────────────────────────────────┼────────┼─────────╢
║ 1  │ Mon, 18 Dec 2017 16:08:32 UTC +00:00  │ Mon, 18 Dec 2017 17:08:32 UTC +00:00 │ "ham"  │ "eggs"  ║
╟────┼───────────────────────────────────────┼──────────────────────────────────────┼────────┼─────────╢
║ 2  │ Mon, 18 Dec 2017 16:08:31 UTC +00:00  │ Mon, 18 Dec 2017 17:08:31 UTC +00:00 │ "spam" │ "bacon" ║
╟────┼───────────────────────────────────────┼──────────────────────────────────────┼────────┼─────────╢
║ 3  │ Mon, 18 Dec 2017 16:08:30 UTC +00:00  │ Mon, 18 Dec 2017 17:08:30 UTC +00:00 │ "ham"  │ "eggs"  ║
╚════╧═══════════════════════════════════════╧══════════════════════════════════════╧════════╧═════════╝

In the example above, all records have overlapping times where overlapping just means that the range of time defined by a record's start_time and end_time (inclusive) covers or extends over part of another record's. However, for this problem we are interested not only in those records which have overlapping times but also have matching bar and baz columns (rows 0, 1, and 3 above). After finding those records we'd like to delete all but the earliest, leaving the table above with just records 2 and 3 because record 2 does not have matching bar and baz columns and 3 does and has the earliest start and end times.

Here's what I have so far:

  delete from foos where id in (
    select
      foo_one.id
    from
      foos foo_one
    where
      user_id = 42
      and exists (
        select
          1
        from
          foos foo_two
        where
          tsrange(foo_two.start_time::timestamp, foo_two.end_time::timestamp, '[]') &&
            tsrange(foo_one.start_time::timestamp, foo_one.end_time::timestamp, '[]')
          and
            foo_one.bar = foo_two.bar
          and
            foo_one.baz = foo_two.baz
          and
            user_id = 42
          and
            foo_one.id != foo_two.id
      )
  );

Thanks for reading!

Update: I've found a solution that works for me, basically I could apply the window function row_number() over a partition of the table that are grouped by bar and baz fields and then add a WHERE clause to the DELETE statement that excludes the first entry (the one with the smallest id).

  delete from foos where id in (
    select id from (
      select
          foo_one.id,
          row_number() over(partition by
                              bar,
                              baz
                            order by id asc)
        from
          foos foo_one
        where
          user_id = 42
          and exists (
            select
              *
            from
              foos foo_two
            where
              tsrange(foo_two.start_time::timestamp,
                        foo_two.end_time::timestamp,
                        '[]') &&
                tsrange(foo_one.start_time::timestamp,
                        foo_one.end_time::timestamp,
                        '[]')
              and
                foo_one.id != foo_two.id
          )
    ) foos where row_number <> 1
  );

Please edit your question and add some [sample data](http://plaintexttools.github.io/plain-text-table/) and the expected output based on that data. [Formatted text](http://stackoverflow.com/help/formatting) please, [no screen shots](http://meta.stackoverflow.com/questions/285551/why-may-i-not-upload-images-of-code-on-so-when-asking-a-question/285557#285557). — klin, Dec 18 '17 at 03:15
Because it is for a RoR project, and I didn't want folks to get tripped up on the ruby-style string interpolation in the query above. — dynsne, Dec 18 '17 at 04:15
I see. However, you've masked everything else with foos and what not. So why not mask the string interpolation, too, and turn it into a pretty-much pure postgreSQL question? — jvillian, Dec 18 '17 at 04:21

Thiago Mata · Answer 1 · 2017-12-18T06:20:09.530

First of all, a small note: you really should give some more information. I understand that you probably don't want to show some real columns of your business, but in the way that it becomes a lot more hard to understand what you want to.

But, I am going to give some tips on that subject. I hope that helps you, and whoever has a similar problem.

You need to be clear what you define as overlaps. That could be a lot of different things to each person.

Look these events:

<--a-->
    <---- b ---->
        <---- c ---->
          <-- d -->
            <---- e ---->
    <------- f -------->
                  <--- g --->

If you define overlaps like the google definition: extend over so as to cover partly, then "b","d","e" and "f" overlaps partly the "c" event. If you define overlaps like the full event of covering, then "c" overlaps "d", and "f" overlaps "b" and "c" and "d".

Deleting groups could be a problem. In that previous case, what we should do? Should we delete "b", "c" and "d" and keep just with "f"? Should we sum their values? Take the average maybe? So, this is a decision to be made, column by column. The meaning of each column is very important. So, I can't help you with "bar" and "baz".

So, trying to guess what you really want to, I am creating a similar table of events with id, begin, end and user_id

create table events (
  id integer,
  user_id integer,
  start_time timestamp,
  end_time timestamp,
  name varchar(100)
);

I am adding the example values

    insert into events
    ( id, user_id, start_time, end_time, name ) values
    ( 1, 1000, timestamp('2017-10-09 01:00:00'),timestamp('2017-10-09 04:00:00'), 'a' );

    insert into events
    ( id, user_id, start_time, end_time, name ) values
    ( 2, 1000, timestamp('2017-10-09 03:00:00'),timestamp('2017-10-09 15:00:00'), 'b' );

    insert into events
    ( id, user_id, start_time, end_time, name ) values
    ( 3, 1000, timestamp('2017-10-09 07:00:00'),timestamp('2017-10-09 19:00:00'), 'c' );

    insert into events
    ( id, user_id, start_time, end_time, name ) values
    ( 4, 1000, timestamp('2017-10-09 09:00:00'),timestamp('2017-10-09 17:00:00'), 'd' );

    insert into events
    ( id, user_id, start_time, end_time, name ) values
    ( 5, 1000, timestamp('2017-10-09 17:00:00'),timestamp('2017-10-09 23:00:00'), 'e' );

    insert into events
    ( id, user_id, start_time, end_time, name ) values
    ( 6, 1000, timestamp('2017-10-09 02:30:00'),timestamp('2017-10-09 22:00:00'), 'f' );

    insert into events
    ( id, user_id, start_time, end_time, name ) values
    ( 7, 1000, timestamp('2017-10-09 17:30:00'),timestamp('2017-10-10 02:00:00'), 'g' );

Now, we can play with some nice queries:

List all the events that are full overlaps with another event:

select 
  # EVENT NAME
  event_1.name as event_name,
  # LIST EVENTS THAT THE EVENT OVERLAPS
  GROUP_CONCAT(event_2.name) as overlaps_names
from events as event_1
inner join events as event_2
on
  event_1.user_id = event_2.user_id
and
  event_1.id != event_2.id
and
(
    # START AFTER THE EVENT ONE
    event_2.start_time >= event_1.start_time and
    #  ENDS BEFORE THE EVENT ONE
    event_2.end_time   <= event_1.end_time
)
  group by 
event_1.name

Result:

+------------+----------------+
| event_name | overlaps_names |
+------------+----------------+
| c          | d              |
| f          | b,d,c          |
+------------+----------------+

To detect the partial overlaps, you will need something like this:

select 
  # EVENT NAME
  event_1.name as event_name,
  # LIST EVENTS THAT THE EVENT OVERLAPS
  GROUP_CONCAT(event_2.name) as overlaps_names
from events as event_1
inner join events as event_2
on
  event_1.user_id = event_2.user_id
and
  event_1.id != event_2.id
and
(
  (
    # START AFTER THE EVENT ONE
    event_2.start_time >= event_1.start_time and
    #  ENDS BEFORE THE EVENT ONE
    event_2.start_time <= event_1.end_time
   ) or
  (
    # START AFTER THE EVENT ONE
    event_2.end_time >= event_1.start_time and
    #  ENDS BEFORE THE EVENT ONE
    event_2.end_time <= event_1.end_time
   )
)
  group by 
event_1.name

Result:

+------------+----------------+
| event_name | overlaps_names |
+------------+----------------+
| a          | b,f            |
| b          | c,d,a          |
| c          | b,d,e,g        |
| d          | b,e            |
| e          | f,g,d,c        |
| f          | a,g,b,d,c,e    |
| g          | c,e,f          |
+------------+----------------+

Of course, I am using a "group by" to make easier to read. That could be useful too if you want to sum or take the average of the overlaps data to update your parent data before the delete. Maybe that "group_concat" function does not exist into Postgres or have a different name. One "standard SQL" that you could test it is:

select 
  # EVENT NAME
  event_1.name as event_name,
  # LIST EVENTS THAT THE EVENT OVERLAPS
  event_2.name as overlaps_name
from events as event_1
inner join events as event_2
on
  event_1.user_id = event_2.user_id
and
  event_1.id != event_2.id
and
(
    # START AFTER THE EVENT ONE
    event_2.start_time >= event_1.start_time and
    #  ENDS BEFORE THE EVENT ONE
    event_2.end_time   <= event_1.end_time
)

Result:

+------------+---------------+
| event_name | overlaps_name |
+------------+---------------+
| f          | b             |
| f          | c             |
| c          | d             |
| f          | d             |
+------------+---------------+

If you are going to try some math operations, keep in mind the risk of adding the value of the "c" and "d" data on "b" and adding their value again on "f", making the value of "f" wrong.

// should be
new f = old f + b + old c + d
new c = old c + b + d // unecessary if you are going to delete it

// very common mistake
new c = old c + b + d // unecessary but not wrong yet
new f = new c + b + d = ( old c + b + d ) + b + d // wrong!!

You can test all these queries and create your own into the same database online using this URL http://sqlfiddle.com/#!9/1d2455/19. But, keep in mind that it is Mysql, not Postgresql. But it is very good to test standard SQL.

There is a thread into StackOverflow about translating group_concat to Postgres https://stackoverflow.com/questions/2560946/postgresql-group-concat-equivalent. Looks simple. — Thiago Mata, Dec 18 '17 at 06:22
Thanks for your response! I didn't end up on this route but it is an interesting approach. — dynsne, Dec 18 '17 at 23:36

PostgreSQL query to delete records with overlapping times while preserving the earliest?

1 Answers1

Linked