3

I have a table like this:

ID    BEGIN    END

If there are overlapping episodes for the same ID (like 2000-01-01 - 2001-12-31 and 2000-06-01 - 2002-06-31) I would like the rows to be merged, using MIN(BEGIN), MAX(END).

The same should be done if episodes are in a direct succession (like 2000-01-01 - 2000-06-31 and 2000-07-01 - 2000-12-31).

If there are "missing" days between episodes (like 2000-01-01 - 2000-06-15 and 2000-07-01 - 2000-12-31), they should not be merged.

How can this be achieved?

Currently my code looks like this:

SELECT "ID", MIN("BEGIN"), MAX("END")
FROM ...
GROUP BY "ID"

but of course, this doesn't fulfill the last condition (not to merge if there are "missing" days).

Thank you in advance!

[edit]

I am working on a solution, where I join the table with itself. It's an improvement, but it doesn't do the job yet. I think the other suggestions are better (but more complicated). However, I'd like to share my unfinished work in progress:

SELECT "ID", LEAST(tab1."BEGIN", tab2."BEGIN"), GREATEST(tab1."END", tab2."END")
  FROM <mytable> AS tab1
  JOIN <mytable> AS tab2
    ON tab1."ID" = tab2."ID"
    AND  (tab1."BEGIN", tab1."END" + INTERVAL '2 day') OVERLAPS (tab2."BEGIN", tab2."END")
  ORDER BY "ID"

[edit 2]

Thank you for your help!

I tried to figure out how window-functions and WITH-queries work for some hours by now - until I realised that my database runs on PostGreSQL 8.3 (which doesn't support neither of them). Is there a way to go without window-functions and WITH-queries?

Thank you once more!

[edit 3]

Sample data:

ID        BEGIN         END
1;"2000-01-01";"2000-03-31"  
1;"2000-04-01";"2000-05-31"  
1;"2000-04-15";"2000-07-31"  
1;"2000-09-01";"2000-10-31"  
2;"2000-02-01";"2000-03-15"  
2;"2000-01-15";"2000-03-31"  
2;"2000-04-01";"2000-04-15"  
3;"2000-06-01";"2000-06-15"  
3;"2000-07-01";"2000-07-15"  

Sample output:

ID        BEGIN         END
1;"2000-01-01";"2000-07-31"
1;"2000-09-01";"2000-10-31"
2;"2000-01-15";"2000-04-15"
3;"2000-06-01";"2000-06-15"
3;"2000-07-01";"2000-07-15"

[edit 4]

one possible solution:

WITH
  t1 AS (
    SELECT id, begin AS time
      FROM "nace-8510-test".checkfkt
    UNION ALL
    SELECT id, end
      FROM "nace-8510-test".checkfkt
  ),

  t2 AS (
    SELECT Row_Number() OVER(PARTITION BY id ORDER BY time) AS num, id, time
      FROM t1 AS t1_1
  ),

  t3 AS (
    SELECT t2_1.num - Row_Number() OVER(PARTITION BY t2_1.id ORDER BY t2_1.time, t2_2.time) num1,
        t2_1.id, t2_1.time AS begin, t2_2.time AS end
      FROM t2 AS t2_1
        INNER JOIN t2 AS t2_2
          ON t2_1.id = t2_2.id
            AND t2_1.num = t2_2.num - 1
      WHERE
        EXISTS (
          SELECT *
            FROM "nace-8510-test".checkfkt AS s
            WHERE s.id = t2_1.id
              AND (s.begin < t2_2.time AND s.end > t2_1.time)
        )
        OR t2_1.time = t2_2.time
        OR t2_1.time + INTERVAL '1 day' = t2_2.time
  )

SELECT id, MIN(begin) AS von, MAX(end) AS bis
  FROM t3
  GROUP BY id, num1
  ORDER BY id

With many thanks to the author of this article: http://blog.developpez.com/sqlpro/p9821/langage-sql-norme/agregation-d-intervalles-en-sql-1/

speendo
  • 13,045
  • 22
  • 71
  • 107
  • 1
    "Is there a way to go without window-functions and WITH-queries?" -- probably not, because you'll need to recursively merge rows at one point or another. – Denis de Bernardy May 30 '11 at 16:16
  • I see - thank you. My attempt is therefore no way to go? I mean it gives me all the rows I need, my only difficulty is to exclude unneeded rows. However, I've asked my admin to update the postgresql-server. Hopefully He will be so kind. – speendo May 30 '11 at 16:37
  • 1
    Without recursive queries, you can find the boundaries of using huge (and very slow) joins on curr.start <= prev.end and the like. But I'm not aware of any means to merge the rows together without recursion. – Denis de Bernardy May 30 '11 at 20:08
  • 1
    See this related thread: http://stackoverflow.com/questions/6018445/get-list-with-start-and-end-values-from-table-of-datetimes – Denis de Bernardy May 30 '11 at 20:09
  • That's what I suspected - speed is the backdraw. As I have a really huge database recursion is probably the only reasonable way to go. Hope my admin can help me... btw. thanks for the link – speendo May 30 '11 at 22:20
  • 1
    See also: http://stackoverflow.com/questions/6186854/how-to-select-lines-in-sql-while-a-condition-lasts – Denis de Bernardy May 31 '11 at 11:51

4 Answers4

3

Edit: That is great news that your DBA agreed to upgrade to a newer version of PostgreSQL. The windowing functions alone make the upgrade a worthwhile investment.

My original answer—as you note—has a major flaw: a limitation of one row per id.
Below is a better solution without such a limitation.
I have tested it using test tables on my system (8.4).

If / when you get a moment I would like to know how it performs on your data.
I also wrote up an explanation here: https://www.mechanical-meat.com/1/detail

WITH RECURSIVE t1_rec ( id, "begin", "end", n ) AS (
    SELECT id, "begin", "end", n
      FROM (
        SELECT
            id, "begin", "end",
            CASE 
                WHEN LEAD("begin") OVER (
                PARTITION BY    id
                ORDER BY        "begin") <= ("end" + interval '2' day) 
                THEN 1 ELSE 0 END AS cl,
            ROW_NUMBER() OVER (
                PARTITION BY    id
                ORDER BY        "begin") AS n
        FROM mytable 
    ) s
    WHERE s.cl = 1
  UNION ALL
    SELECT p1.id, p1."begin", p1."end", a.n
      FROM t1_rec a 
           JOIN mytable p1 ON p1.id = a.id
       AND p1."begin" > a."begin"
       AND (a."begin",  a."end" + interval '2' day) OVERLAPS 
           (p1."begin", p1."end")
)
SELECT t1.id, min(t1."begin"), max(t1."end")
  FROM t1_rec t1
       LEFT JOIN t1_rec t2 ON t1.id = t2.id 
       AND t2."end" = t1."end"
       AND t2.n < t1.n
 WHERE t2.n IS NULL
 GROUP BY t1.id, t1.n
 ORDER BY t1.id, t1.n;

Original (deprecated) answer follows;
note: limitation of one row per id.


Denis is probably right about using lead() and lag(), but there is yet another way!
You can also solve this problem using so-called recursive SQL.
The overlaps function also comes in handy.

I have fully tested this solution on my system (8.4).
It works well.

WITH RECURSIVE rec_stmt ( id, begin, end ) AS (
    /* seed statement: 
           start with only first start and end dates for each id 
    */
      SELECT id, MIN(begin), MIN(end)
        FROM mytable seed_stmt
    GROUP BY id

    UNION ALL

    /* iterative (not really recursive) statement: 
           append qualifying rows to resultset 
    */
      SELECT t1.id, t1.begin, t1.end
        FROM rec_stmt r
             JOIN mytable t1 ON t1.id = r.id
         AND t1.begin > r.end
         AND (r.begin, r.end + INTERVAL '1' DAY) OVERLAPS 
             (t1.begin - INTERVAL '1' DAY, t1.end)
)
  SELECT MIN(begin), MAX(end) 
    FROM rec_stmt
GROUP BY id;
mechanical_meat
  • 163,903
  • 24
  • 228
  • 223
  • 1
    Nice. I would only use `(t1.begin, t1.end + INTERVAL '1' DAY)`. I think a few edge cases may show wrong without this change. – ypercubeᵀᴹ May 30 '11 at 12:56
  • Thank you for the comment and suggestion. I think that the way `OVERLAPS` works is mildly counterintuitive in that the two ranges must actually overlap as opposed to sharing an endpoint (unless one or both of the two ranges happens to have the endpoint as both start and end date). – mechanical_meat May 30 '11 at 13:02
  • 1
    @Adam: I didn't know that. So, maybe a `+ INTERVAL '2' DAY` may be suitable to catch the case `2000-01-01 - 2001-12-31` and `2002-01-01 - 2002-06-31`. – ypercubeᵀᴹ May 30 '11 at 13:13
  • 1
    Isn't this going to return only one range for every ID? – Andriy M May 30 '11 at 14:23
  • 1
    One more thing, which I didn't notice at first. These two conditions should definitely not be applied together: `t1.id = r.id AND t1.id > r.id`. – Andriy M May 30 '11 at 15:18
  • 1
    Thank you for the comment and observation. That extra condition was appropriate only to the table on which I tested this query. I have now changed it to be appropriate to Marcel's table. – mechanical_meat May 30 '11 at 17:36
  • My Admin was so kind to update to 8.4 - now I was able to test this solution. Am I right, that with this solution I would always get only one occurence of an ID? This would be a serious backdraw for instance if you have data like this: `(ID, begin, end)`, `(1, 2000-01-01, 2000-03-31)`, `(1, 2000-04-01, 2000-05-31)`, `(1, 2001-07-01, 2000-08-31)` - the result should be `(1, 2000-01-01, 2000-05-31)`, `(1, 2001-07-01, 2000-08-31)` – speendo Jun 15 '11 at 10:08
  • 1
    @Marcel: I responded as soon as I could. Please see my edited answer which does not have such a limitation. Congratulations on getting your DBA to agree to an upgrade; that's great news! – mechanical_meat Jun 15 '11 at 18:29
  • Thank you for caring so much! I tried your solution - there seems to be one problem left (I come to that later). However to talk about performance: my problem is, that this query simply doesn't perform (neither do other solutions I found yet) in a reasonable amount of time. This is probably because of the fact, a solution in SQL seems impossible, without joining the table (or parts of it) with itself at least 3 times. I will compare that with a solution in a procedural language. – speendo Jun 20 '11 at 08:38
  • Coming to the problem in your solution: Maybe this is what you intended, but rows, that don't overlap seem to be left out (do you mean that with "non-overlapping date-ranges should not be condensed")!? (Please note the example data I will add to my posting in a few minutes). If you could change that, I'd like to mark your answer as accepted (maybe you could also link to this article: blog.developpez.com/sqlpro/p9821/langage-sql-norme/…). I'll try a solution in R now, as I hope R would be faster than SQL in this specific case. I'll report about my findings. – speendo Jun 20 '11 at 09:05
  • You did a great job, therefore I will accept your answer now. However, it turned out, that this problem an be solved MUCH faster (by the factor of 1.000 or more) using a procedural language (I used R). Of course this is my fault (as the one who asked for an SQL-solution) not yours. Thank you for your help! – speendo Jul 13 '11 at 09:18
2

I'm not making full sense of your question, but I'm absolutely certain that you need to look into the lead()/lag() window functions.

Something like this, for instance, will be a good starting point to place in a subquery or a common table expression, in order to identify whether rows overlap or not per id:

select id,
       lag(start) over w as prev_start,
       lag(end) over w as prev_end,
       start,
       end,
       lead(start) over w as next_start,
       lead(end) over w as next_end
from yourtable
window w as (
       partition by id
       order by start, end
       )
Denis de Bernardy
  • 75,850
  • 13
  • 131
  • 154
0

Regarding your second concern, I'm not sure about PostgreSQL, but in SQL Server there's a DATEDIFF(interval, start_date, end_date) that gives you the interval specified between two dates. You could use the MIN(Begin) as a start date and MAX(End) as end date to get the interval difference. You could then use this in a case statement to output something, although you might be needing to make a sub-query or something equivalent for your scenario.

Mentor
  • 485
  • 1
  • 5
  • 18
  • hm I think there's a problem whith this solution: with min(begin) you get the first entry of "begin", with max(end) you get the last entry of "end". but you would loose information on all "begins" and "ends" between the first "begin" and the last "end" – speendo May 30 '11 at 12:59
  • 1
    You could calculate the difference between min and max and then for each row you could sum their respective min and max, and if the number is lower than the first, then it means it has free days in the schedule. That's what i was thinking in the first place :) – Mentor May 30 '11 at 15:23
  • It would work in every case i think, depending how you implement it :) – Mentor May 31 '11 at 09:38
0

Pure SQL

For a pure SQL-solution, look at Adam's post and read this article this article (it is written in french, however you will find out it's not too hard to read). The article was recommended to me after consulting the postgresql-mailing-list (thank you for that!).

For my data this was not suitable because all possible solutions need to self join a table at least 3 times. This turns out to be a problem for (very) large amounts of data.

Semi SQL, Semi imperative Language

If you primarily care about speed and you have the possibility to use an imperative language, you can get much faster (depending on the amount of data, of course). In my case the task performed (at least) 1.000 times faster, using R.

Steps:

(1) Get a .csv-file. Take care of sorting!!!

COPY (
  SELECT "ID", "BEGIN", "END"
  <sorry, for a reason I don't know StackOverflow won't let me finish my code here...>

(2) Do something like this (this code is R, but you could do something similar in any imperative language):

data - read.csv2("</path/to.csv>")
data$BEGIN - as.Date(data$BEGIN)
data$END - as.Date(data$END)

smoothingEpisodes - function (theData) {

    theLength - nrow(theData)
    if (theLength  2L) return(theData)

    ID - as.integer(theData[["ID"]])
    BEGIN - as.numeric(theData[["BEGIN"]])
    END - as.numeric(theData[["END"]])

    curId - ID[[1L]]
    curBEGIN - BEGIN[[1L]]
    curEND - END[[1L]]



    out.1 - integer(length = theLength)
    out.2 - out.3 - numeric(length = theLength)

    j - 1L

    for(i in 2:nrow(theData)) {
        nextId - ID[[i]]
        nextBEGIN - BEGIN[[i]]
        nextEND - END[[i]]

        if (curId != nextId | (curEND + 1)  nextBEGIN) {
            out.1[[j]] - curId
            out.2[[j]] - curBEGIN
            out.3[[j]] - curEND

            j - j + 1L

            curId - nextId
            curBEGIN - nextBEGIN
            curEND - nextEND
        } else {
            curEND - max(curEND, nextEND, na.rm = TRUE)
        }
    }

    out.1[[j]] - curId
    out.2[[j]] - curBEGIN
    out.3[[j]] - curEND

    theOutput - data.frame(ID = out.1[1:j], BEGIN = as.Date(out.2[1:j], origin = "1970-01-01"), END = as.Date(out.3[1:j], origin = "1970-01-01"))

    theOutput
}

data1 - smoothingEpisodes(data)

data2 - transform(data1, TAGE = (as.numeric(data1$END - data1$BEGIN) + 1))

write.csv2(data2, file = "</path/to/output.csv>")

You can find a detailed discussion on this R-Code here: "smoothing" time data - can it be done more efficient?

Community
  • 1
  • 1
speendo
  • 13,045
  • 22
  • 71
  • 107