3

I have two tables both of which have columns StartDate and EndDate.

I'm trying to return a single resultset that contains all date ranges from one table (TableA), and all complement date ranges from the other one (TableB).

CREATE TABLE [dbo].[TableA](
    [ID] [int] NOT NULL,
    [StartDate] [datetime] NOT NULL,
    [EndDate] [datetime] NOT NULL
)

CREATE TABLE [dbo].[TableB](
    [ID] [int] NOT NULL,
    [StartDate] [datetime] NOT NULL,
    [EndDate] [datetime] NOT NULL
)

INSERT INTO TableA (ID, StartDate, EndDate) VALUES(1, '4/1/2009', '8/1/2009')
INSERT INTO TableA (ID, StartDate, EndDate) VALUES(1, '10/1/2009', '12/1/2009')
INSERT INTO TableB (ID, StartDate, EndDate) VALUES(1, '1/1/2009', '2/1/2010')

INSERT INTO TableA (ID, StartDate, EndDate) VALUES(2, '4/1/2009', '8/1/2009')
INSERT INTO TableB (ID, StartDate, EndDate) VALUES(2, '1/1/2009', '5/1/2009')
INSERT INTO TableB (ID, StartDate, EndDate) VALUES(2, '7/1/2009', '12/1/2009')

The expected resultset from the three datasets should be:

(ID = 1)
1/1/2009 - 4/1/2009 (from TableB)
4/1/2009 - 8/1/2009 (from TableA)
8/1/2009 - 10/1/2009 (from TableB)
10/1/2009 - 12/1/2009 (from TableA)
12/1/2009 - 2/1/2010 (from TableB)

(ID = 2)
1/1/2009 - 4/1/2009 (from TableB)
4/1/2009 - 8/1/2009 (from TableA)
8/1/2009 - 12/1/2009 (from TableB)

The date ranges are not guaranteed to be continuous, and I can't make any assumptions on how they're overlapping between tables...within each table they can be assumed to not overlap.

I'm having problems wrapping my head around how to split the single date ranges in TableB into multiple pieces to find all the complement "regions" within it in SQL.

Anyone have any suggestions?

Quassnoi
  • 413,100
  • 91
  • 616
  • 614
Tero Paananen
  • 103
  • 1
  • 4
  • 1
    define 'complement' in this context. – Remus Rusanu Dec 28 '09 at 21:04
  • How many rows do you have? Is performance an issue? – Mark Byers Dec 28 '09 at 21:10
  • I think he means he wants all the rows from A, plus all the parts of periods in B which don't overlap with any period in A. In other words: `A union (B minus (B intersect A))`, so that A and B' are disjoint and `A union B' == A union B`. – Mark Byers Dec 28 '09 at 21:53
  • Mark got it right. Upto 100,000 rows on TableB...less than 500 on TableA, btw. – Tero Paananen Dec 28 '09 at 22:03
  • I guess since you've already accepted the answer I guess it means that performance is good enough for you, but I'd be quite interested if you can let me know roughly how fast the query is on your real data. – Mark Byers Dec 28 '09 at 22:15
  • 1
    PS +1 on the question for including test data with create scripts. I probably wouldn't have bothered with this question if you hadn't made the create scripts as it would be too boring to make them myself (even though it wouldn't have taken all that long). I wish more people would take the time to do this to save everyone else the effort. – Mark Byers Dec 28 '09 at 22:22
  • 1
    Ok, adjusted the solution to match the real-world situation and copied the production data onto my dev environment. 77K records in TableB, 5K records in TableA (will be LOT less once I'm done with the project). Total runtime on the view: 17 seconds resulting in 96K rows. That's not bad, but can definitely be improved with better indexes in the tables in question. It runs in less than a second when there're 100 records in TableA. – Tero Paananen Dec 28 '09 at 22:49
  • 1
    Runtime reduced to 1 second with proper indexing. Very nice! – Tero Paananen Dec 28 '09 at 22:53
  • I'm glad to hear that it works well enough for you. Thanks for posting back with performance details from your real data. – Mark Byers Dec 29 '09 at 00:39

1 Answers1

1

If you create this as a view, I think it does what you want. It uses CTEs, which should be supported by SQL Server 2005, but not earlier.

WITH Timestamps AS (
    SELECT Id, StartDate AS Date FROM TableA
    UNION
    SELECT Id, EndDate AS Date FROM TableA
    UNION
    SELECT Id, StartDate AS Date FROM TableB
    UNION
    SELECT Id, EndDate AS Date FROM TableB
), Timestamps2 AS (
    SELECT ROW_NUMBER() OVER (ORDER BY Id, Date) AS RowNumber, * FROM Timestamps
), Timestamps3 AS (
    SELECT T1.ID, T1.Date AS StartDate, T2.Date AS EndDate
    FROM Timestamps2 AS T1 JOIN Timestamps2 AS T2
    ON T1.RowNumber + 1 = T2.RowNumber AND T1.ID = T2.ID
), IntervalsFromB AS (
    SELECT T.ID, T.StartDate, T.EndDate FROM Timestamps3 AS T
    LEFT JOIN TableA AS A
    ON T.StartDate >= A.StartDate AND T.EndDate <= A.EndDate
    WHERE A.StartDate IS NULL)
SELECT * FROM TableA
UNION ALL
SELECT * FROM IntervalsFromB

Full output (ordered by Id, StartDate for readability):

Id  StartDate               EndDate
1   2009-01-01 00:00:00.000 2009-04-01 00:00:00.000
1   2009-04-01 00:00:00.000 2009-08-01 00:00:00.000
1   2009-08-01 00:00:00.000 2009-10-01 00:00:00.000
1   2009-10-01 00:00:00.000 2009-12-01 00:00:00.000
1   2009-12-01 00:00:00.000 2010-02-01 00:00:00.000
2   2009-01-01 00:00:00.000 2009-04-01 00:00:00.000
2   2009-04-01 00:00:00.000 2009-08-01 00:00:00.000
2   2009-08-01 00:00:00.000 2009-12-01 00:00:00.000

It was pretty complicated for me to implement this, so I'm wondering if anyone can see a simpler way. I might be missing some trick that makes this much simpler. If so, please let me know! Also, you will almost certainly need some indexes on your tables to get this to perform well if you have a lot of rows. Some other optimizations may be possible - I haven't tried for the fastest possible performance, but just to get the correct result.

Mark Byers
  • 811,555
  • 193
  • 1,581
  • 1,452
  • 1
    You can replace the final `UNION` with a `FULL JOIN`, otherwise the query is correct. See here: http://explainextended.com/2009/11/09/inverting-date-ranges/ – Quassnoi Dec 28 '09 at 21:53
  • Nice link - it pretty much explains the query I just wrote. I would never have found that by Googling. – Mark Byers Dec 28 '09 at 22:00
  • PS, I think my final UNION ALL is correct - that's just the part where I combine the results from TableA and (TableB-TableA). I think the part you are referring to with the FULL JOIN in my query is inside Timestamps3 (yeah, bad names, I'm sorry) where instead I do an 'INNER JOIN'. This kills the two rows with NULLs but I think that's what he wants anyway, so I don't think any change is required. – Mark Byers Dec 28 '09 at 22:05
  • You could probably also change some of the 'UNIONs' to 'UNION ALLs' in the first CTE to get better performance. I think it's a minor issue so I ignored this. – Mark Byers Dec 28 '09 at 22:07
  • Good stuff. Looks to be working very nicely with the example dataset. I'll give it a go on my real dataset in a sec. Thanks a lot Mark, never would've figured this one on my own. – Tero Paananen Dec 28 '09 at 22:12