1

Question: How do I JOIN multiple (3+) tables, which all have SCD type 2 validFrom/validTo dates in them?

I have the following tables:

-- table 1
CREATE TABLE dbo.Clients (
    clientCode    varchar(10) NOT NULL,
    startDate     date NOT NULL,
    [name]        varchar(200) NOT NULL,
    CONSTRAINT PK_Clients PRIMARY KEY CLUSTERED (clientCode, startDate)
);

-- table 2
CREATE TABLE dbo.Projects (
    clientCode    varchar(10) NOT NULL,  --- Each project belongs to a client.
    projectCode   varchar(10) NOT NULL,
    startDate     date NOT NULL,
    [name]        varchar(200) NOT NULL,
    CONSTRAINT PK_Projects PRIMARY KEY CLUSTERED (projectCode, startDate)
);

.. with the following dummy data:

-- dummy data
INSERT INTO dbo.Clients (clientCode, startDate, [name])
VALUES ('A', {d '2010-01-01'}, 'Client A (first)'),
       ('A', {d '2011-04-01'}, 'Client A (second)'),
       ('A', {d '2011-09-01'}, 'Client A (third)'),
       ('A', {d '2012-02-01'}, 'Client A (fourth)'),
       ('A', {d '2014-01-01'}, 'Client A (fifth)'),
       ('B', {d '2010-01-01'}, 'Client B (first)'),
       ('B', {d '2011-02-01'}, 'Client B (second)'),
       ('B', {d '2011-08-01'}, 'Client B (third)'),
       ('B', {d '2011-12-01'}, 'Client B (fourth)'),
       ('B', {d '2012-11-01'}, 'Client B (fifth)');

-- dummy data
INSERT INTO dbo.Projects (clientCode, projectCode, startDate, [name])
VALUES ('A', '1', {d '2010-01-15'}, 'Project 1, first revision'),
       ('A', '1', {d '2012-04-22'}, 'Project 1, second revision'),
       ('A', '2', {d '2010-02-08'}, 'Project 2, first revision'),
       ('A', '2', {d '2010-09-12'}, 'Project 2, second revision'),
       ('A', '2', {d '2012-08-18'}, 'Project 2, third revision'),
       ('B', '3', {d '2011-04-01'}, 'Project 3, first revision'),
       ('B', '3', {d '2011-12-01'}, 'Project 3, second revision'),
       ('B', '3', {d '2014-02-28'}, 'Project 3, third revision');

Using these two tables, we generate startDate and endDate intervals:

--- Clients:
WITH c (clientCode, [name], startDate, endDate) AS (
    SELECT clientCode, [name], startDate,
           --- Find the next record's startDate, ordered by startDate.
           LEAD(startDate, 1, {d '2099-12-31'}) OVER (
               PARTITION BY clientCode
               ORDER BY startDate) AS endDate
    FROM dbo.Clients),

--- Projects:
     p (projectCode, clientCode, [name], startDate, endDate) AS (
    SELECT projectCode, clientCode, [name], startDate,
           --- Find the next record's startDate, order by startDate
           LEAD(startDate, 1, {d '2099-12-31'}) OVER (
               PARTITION BY projectCode
               ORDER BY startDate) AS endDate
    FROM dbo.Projects)

SELECT c.clientCode, c.[name] AS clientName,
       p.projectCode, p.[name] AS projectName,
       --- Start date is the last of (c.startDate, p.startDate)
       (CASE WHEN c.startDate<p.startDate THEN p.startDate ELSE c.startDate END) AS startDate,
       --- End date is the first of (c.endDate, p.endDate)
       (CASE WHEN c.endDate<p.endDate THEN c.endDate ELSE p.endDate END) AS endDate
FROM c
LEFT JOIN p ON
    c.clientCode=p.clientCode AND
    c.startDate<p.endDate AND
    c.endDate>p.startDate

-- IF two new tables were introducted (t3 and t4), would the following JOINS work?
-- LEFT JOIN dbo.Table3 as t3
-- on p.clientCode = t3.clientcode AND
-- p.startdate<t3.endate AND
-- p.endDate>t3.startdate
-- LEFT JOIN dbo.Table4 as t4
-- on t3.toolId = t4.toolid AND      --> toolId is a new key that I need for the join, since t4 does not have clientCode
-- t3.startdate<t4.enddate AND
-- t3.enddate>t4.startdate
ORDER BY c.clientCode, p.projectCode, 5;

My problem: In the bottom of the above query, I commented out the LEFT JOINS, which I will have to make when more SCD2 tables are introduced. I am unsure if the commented out LEFT JOINS i made will work. Do you see any issues with it?

Adding more JOINS maybe conflicts the CASE WHEN statement used in the above query..:

       --- Start date is the last of (c.startDate, p.startDate)
       (CASE WHEN c.startDate<p.startDate THEN p.startDate ELSE c.startDate END) AS startDate,
       --- End date is the first of (c.endDate, p.endDate)
       (CASE WHEN c.endDate<p.endDate THEN c.endDate ELSE p.endDate END) AS endDate

This CASE when statement is used because I want no two intervals to reference the same date. So, the output interval is defined by the larger of (a.startTime, b.startTime) and the smaller of (a.endTime, b.endTime).

I see an issue here, since this CASE WHEN statement only evaluates startDate and endDate intervals from 2 tables and not 3, 4 or more tables.

How can this perhaps be solved?

Dale K
  • 25,246
  • 15
  • 42
  • 71
matt_vice89
  • 65
  • 2
  • 6
  • 1
    Please don't put spaces between all your code... it makes the question twice are long and much harder to read. – Dale K May 19 '22 at 23:45
  • 2
    FYI `case` is an *expression* not a *statement*. – Dale K May 19 '22 at 23:45
  • I think you want to try and create a [mre] - with emphasis on minimal - as it stands its quite complicated which is going to put people off. Either way you need to flesh out your example to include data from the new tables AND to show your desired results. – Dale K May 19 '22 at 23:54
  • 2
    Did you need a LEFT JOIN? The CASE WHEN calculating the result's start/enddates doesn't seem to take the NULLs from p into account. As for joining more tables: I would treat this as the Step1 (subquery), and join the next table, calculate its start/end, and then add another layer. This is becuase you need to find StartDate as LATEST of each of the StartDates, and the EndDate as the EARLIEST of EndDates (there are no such functions in SqlServer, and implementing them in a CASE WHEN with 4 values, which can also be NULLs can be very complex). – tinazmu May 20 '22 at 00:34

1 Answers1

0

Would you be interested in using SqlServer's geometry data type to represent time periods? Here I applied it to your example:

WITH c (clientCode, [name], Perd) AS (
    SELECT clientCode, [name],
           Perd=geometry::STGeomFromText('LINESTRING (' + format(startdate,'yyyyMMdd')+' 0, '+
                      format(LEAD(startDate, 1, {d '2099-12-31'}) OVER (
                               PARTITION BY clientCode
                               ORDER BY startDate) , 'yyyyMMdd') +' 0)', 0)
    FROM #Clients),

--- Projects:
     p (projectCode, clientCode, [name], Perd) AS (
    SELECT projectCode, clientCode, [name], 
           Perd=geometry::STGeomFromText('LINESTRING (' + format(startdate,'yyyyMMdd')+' 0, '+
                       format(LEAD(startDate, 1, {d '2099-12-31'}) OVER (
                                PARTITION BY projectCode
                                ORDER BY startDate) , 'yyyyMMdd') +' 0)', 0)
    FROM #Projects)
SELECT c.clientCode, c.[name] AS clientName,
       p.projectCode, p.[name] AS projectName,
       startDate=try_cast(format(c.Perd.STIntersection(p.Perd).STEndPoint().STX ,'########') as date),
       endDate=try_cast(format(c.Perd.STIntersection(p.Perd).STStartPoint().STX, '########') as date)
FROM 
    c
    inner join
    p on
    c.clientCode=p.clientCode AND p.Perd.STIntersection(c.Perd).STLength()>0
order by 1,5

This can be easier to nest as a subquery, and join to another temporal table.

I would imagine that this wouldn't be very fast with very large data-sets, though.

tinazmu
  • 3,880
  • 2
  • 7
  • 20