5

Suppose I have the following an event table with personId, startDate and endDate.

I want to know how much time the person X spent doing an event (the events can override each other).

If the person just has 1 event, its easy: datediff(dd, startDate, endDate)

If the person has 2 events it gets tricky.

I'll set some scenarios for the expected results.

Scenario 1

startDate endDate
1         4
3         5

This means he the results should be the datediff from 1 to 5

Scenario 2

startDate endDate
1         3
6         9

this means he the results should be the some of datediff(dd,1,3) and datediff(dd,6,9)

How can I get this result on an sql query? I can only think of a bunch of if statements, but the same person can have n events so the query will be really confusing.

Shredder Edit: I'd like to add a 3rd scenario:

startDate endDate
1       5
4       8
11      15

Desired result to Shredder scenario:

(1,5) and (4,8) merge in (1,8) since they overlap then we need to datediff(1,8) + datediff(11,15) => 7 + 4 => 11

marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
dcarneiro
  • 7,060
  • 11
  • 51
  • 74

7 Answers7

11

You can use a recursive CTE to build a list of dates and then count the distinct dates.

declare @T table
(
  startDate date,
  endDate date
);

insert into @T values
('2011-01-01', '2011-01-05'),
('2011-01-04', '2011-01-08'),
('2011-01-11', '2011-01-15');

with C as
(
  select startDate,
         endDate
  from @T
  union all
  select dateadd(day, 1, startDate),
         endDate
  from C
  where dateadd(day, 1, startDate) < endDate       
)
select count(distinct startDate) as DayCount
from C
option (MAXRECURSION 0)

Result:

DayCount
-----------
11

Or you can use a numbers table. Here I use master..spt_values:

declare @MinStartDate date
select @MinStartDate = min(startDate)
from @T

select count(distinct N.number)
from @T as T
  inner join master..spt_values as N
    on dateadd(day, N.Number, @MinStartDate) between T.startDate and dateadd(day, -1, T.endDate)
where N.type = 'P'    
Mikael Eriksson
  • 136,425
  • 22
  • 210
  • 281
  • 1
    + I was trying to figure out a way to do this but it escaped me very nice – Conrad Frix Nov 04 '11 at 22:35
  • @Mikael, your solution is pretty awesome. These only one catch that i didn't know it would happen. SQL recursion limit is 100 so it only works for 3 month period. Either way, thumbs up – dcarneiro Nov 07 '11 at 09:32
  • @Daniel, I would recommend you to use the numbers table version. If you need to make the recursive CTE version to handle more than 100 day you should add maxreqursion hint at the end of the query. – Mikael Eriksson Nov 07 '11 at 11:12
2

The following SQL is for the three scenarios you've described

with sampleData 
AS (


    SELECT       1 personid,1 startDate,4 endDate
    UNION SELECT 1,3,5
    UNION SELECT 2,1,3
    UNION SELECT 2,6,9
    UNION SELECT 3,1,5 
    UNION SELECT 3,4,8
    UNION SELECT 3,11, 15

), 
     cte 
     AS (SELECT personid, 
                startdate, 
                enddate, 
                Row_number() OVER(ORDER BY personid, startdate) AS rn 
         FROM   sampledata), 
     overlaps 
     AS (SELECT a.personid, 
                a.startdate, 
                b.enddate, 
                a.rn id1, 
                b.rn id2 
         FROM   cte a 
                INNER JOIN cte b 
                  ON a.personid = b.personid 
                     AND a.enddate > b.startdate 
                     AND a.rn = b.rn - 1), 
     nooverlaps 
     AS (SELECT a.personid, 
                a.startdate, 
                a.enddate 
         FROM   cte a 
                LEFT JOIN overlaps b 
                  ON a.rn = b.id1 
                      OR a.rn = b.id2 
         WHERE  b.id1 IS NULL) 
SELECT personid, 
       SUM(timespent) timespent 
FROM   (SELECT personid, 
               enddate - startdate timespent 
        FROM   nooverlaps 
        UNION 
        SELECT personid, 
               enddate - startdate 
        FROM   overlaps) t 
GROUP  BY personid 

Produces this result

Personid    timeSpent
----------- -----------
1           4
2           5
3           11

Notes: I used the simple integers but the DateDiffs should work too

Correctness issue There is a correctness issue if your data is allowed to have multiple overlaps as Cheran S noted, the results won't be correct and you should use one of the other answers instead. His example used [1,5],[4,8],[7,11] for the same person ID

Conrad Frix
  • 51,984
  • 12
  • 96
  • 155
2

Here's a solution that uses the Tally table idea (which I first heard of in an article by Itzk Ben-Gan -- I still cut and paste his code whenver the subject comes up). The idea is to generate a list of ascending integers, join the source data by range against the numbers, and then count the number of distinct numbers, as follows. (This code uses syntax from SQL Server 2008, but with minor modifications would work in SQL 2005.)

First set up some testing data:

CREATE TABLE #EventTable
 (
   PersonId   int  not null
  ,startDate  datetime  not null
  ,endDate    datetime  not null
 )

INSERT #EventTable
 values (1, 'Jan 1, 2011', 'Jan 4, 2011')
       ,(1, 'Jan 3, 2011', 'Jan 5, 2011')
       ,(2, 'Jan 1, 2011', 'Jan 3, 2011')
       ,(2, 'Jan 6, 2011', 'Jan 9, 2011')

Determine some initial values

DECLARE @Interval bigint ,@FirstDay datetime ,@PersonId int = 1 -- (or whatever)

Get the first day and the maximum possible number of dates (to keep the cte from generating extra values):

SELECT
   @Interval = datediff(dd, min(startDate), max(endDate)) + 1
  ,@FirstDay = min(startDate)
 from #EventTable
 where PersonId = @PersonId

Cut and paste over the one routine and modify and test it to only return as many integers as we'll need:

/*
;WITH
  Pass0 as (select 1 as C union all select 1), --2 rows
  Pass1 as (select 1 as C from Pass0 as A, Pass0 as B),--4 rows
  Pass2 as (select 1 as C from Pass1 as A, Pass1 as B),--16 rows
  Pass3 as (select 1 as C from Pass2 as A, Pass2 as B),--256 rows
  Pass4 as (select 1 as C from Pass3 as A, Pass3 as B),--65536 rows
  Pass5 as (select 1 as C from Pass4 as A, Pass4 as B),--4,294,967,296 rows
  Tally as (select row_number() over(order by C) as Number from Pass5)
 select Number from Tally where Number <= @Interval
*/

And now revise it by first joining to the intervals defined in each source row, and then count each distinct value found:

;WITH
  Pass0 as (select 1 as C union all select 1), --2 rows
  Pass1 as (select 1 as C from Pass0 as A, Pass0 as B),--4 rows
  Pass2 as (select 1 as C from Pass1 as A, Pass1 as B),--16 rows
  Pass3 as (select 1 as C from Pass2 as A, Pass2 as B),--256 rows
  Pass4 as (select 1 as C from Pass3 as A, Pass3 as B),--65536 rows
  Pass5 as (select 1 as C from Pass4 as A, Pass4 as B),--4,294,967,296 rows
  Tally as (select row_number() over(order by C) as Number from Pass5)
SELECT PersonId, count(distinct Number) EventDays
 from #EventTable et
  inner join Tally
   on dateadd(dd, Tally.Number - 1, @FirstDay) between et.startDate and et.endDate
 where et.PersonId = @PersonId
  and Number <= @Interval
 group by PersonId

Take out the @PersonId filter and you'd get it for all persons. And with minor modification you can do it for any time interval, not just days (which is why I set the Tally table to generate severely large numbers.)

Philip Kelley
  • 39,426
  • 11
  • 57
  • 92
1

Try something like this

select 
    personId, 
    sum(DateDuration) as TotalDuration
from
(
    select personId, datediff(dd, startDate, endDate) as DateDuration
    from yourEventTable
) a
group by personId
  • if theres a aps - it wont give you – Royi Namir Nov 04 '11 at 20:53
  • @Shark: on the scenario 2 that would result in datediff(dd, 1, 9) = 8 instead of datediff(dd,1,3) + datediff(dd,6,9) = 2 + 3 = 5 – dcarneiro Nov 04 '11 at 20:55
  • @Daniel oh, I see. I thought you wanted the total duration. Let me modify accordingly. –  Nov 04 '11 at 20:57
  • @Shark, sadly no. On the first screnario that would result on datediff(1,4) + datediff(3,5) = 3 + 2 = 5 but he was on events from day one to day 5 so it should be datediff(1,5) = 4. I need to count the repeated days only once – dcarneiro Nov 04 '11 at 21:03
  • @Daniel ok I need to re-read your question then. I'm a bit confused :) I'll let you know when it is modified. –  Nov 04 '11 at 21:04
  • @Shark read my answer : to him : your example is bad since you have overlapping numbers which cant be possible : a worker started at 1 till 4 and then you say from 3(overlapping) to 5... my example is : 1-3 and 3-5 which is 2+2=4 as much as 1-5 = 4. – Royi Namir Nov 04 '11 at 21:20
1

Algebra. If B-n is the ending time of the nth event, and A-n is the starting time of the nth event, then the sum of the differences is the difference of the sums. So you can write

select everything else, sum(cast(endDate as int)) - sum(cast(startDate as int)) as daysSpent

If your dates have no time component, this works. Otherwise, you could use a real.

drdwilcox
  • 3,833
  • 17
  • 19
  • on the scenario 1 that would result on sum(enddate) - sum(startDate) = (4 + 5) - (1 + 3) = 9 - 4 = 5 but he was on events from day one to day 5: 5 - 1 = 4. I need to count the repeated days only once – dcarneiro Nov 04 '11 at 20:58
  • It's worse than that. I wasn't counting the end day either. I'm a C programmer, the end is never part of the interval :-) – drdwilcox Nov 04 '11 at 21:02
  • In C it's a little easy but i still can't find how to do it in C without a loop and a lot of conditions to see if they overlap. About the end part, you can use events (1,7) and (4,9), this should merge on a timestamp from (1,9) – dcarneiro Nov 04 '11 at 21:06
  • I agree that it is not going to be trivial in any programming language. You are going to have to compare each record pair-wise to find overlap. I'm pretty sure there isn't a solution that isn't O(n^2) – drdwilcox Nov 04 '11 at 21:12
1
;WITH cte(gap)
AS
(
    SELECT sum(b-a) from xxx GROUP BY uid
)

SELECT * FROM cte
Royi Namir
  • 144,742
  • 138
  • 468
  • 792
  • I didn't knew about common table expression on SQL Server. I'm not sure but i think the results this query will give for the first scenario is 5: sum(4-1,5-3) => sum(3,2) => 5 but he was on events from day 1 to day 5. It should return 5-1 => 4. But i'll try with CTE tomorrow on the office PC. Thanks – dcarneiro Nov 04 '11 at 21:12
  • @Daniel your example is **bad** since you have overlapping numbers which cant be possible : a worker started at 1 till 4 and then you say from 3(overlapping) to 5... my example is : 1-3 and 3-5 which is 2+2=4 as much as 1-5 = 4. – Royi Namir Nov 04 '11 at 21:19
  • The overlapping is possible, but lets change scenario 1 to event1(1,7) and event2(4,9) to create less ambiguity. The result should be the difference between 1 and 9 since he was occupied on that dates. – dcarneiro Nov 04 '11 at 21:25
-1

Edit 1: I have modified both solutions to get correct results.

Edit 2: I have done comparative tests using the solutions proposed by Mikael Eriksson, Conrad Frix, Philip Kelley and me. All tests use an EventTable with the following structure:

CREATE TABLE EventTable
(
     EventID    INT IDENTITY PRIMARY KEY
    ,PersonId   INT NOT NULL
    ,StartDate  DATETIME NOT NULL
    ,EndDate    DATETIME NOT NULL
    ,CONSTRAINT CK_StartDate_Before_EndDate CHECK(StartDate < EndDate)
);

Also, all tests use warm buffer (no DBCC DROPCLEANBUFFERS) and cold [plan] cache (I have executed DBCC FREEPROCCACHE before every test). Because some solutions use a filter(PersonId = 1) and others not, I have inserted into EventTable rows for only one person (INSERT ...(PersonId,...) VALUES (1,...)).

These are the results: enter image description here

My solutions use recursive CTEs.

Solution 1:

WITH BaseCTE
AS
(
    SELECT   e.StartDate
            ,e.EndDate
            ,e.PersonId
            ,ROW_NUMBER() OVER(PARTITION BY e.PersonId ORDER BY e.StartDate, e.EndDate) RowNumber
    FROM    EventTable e
),  RecursiveCTE
AS
(
    SELECT   b.PersonId
            ,b.RowNumber

            ,b.StartDate
            ,b.EndDate
            ,b.EndDate AS MaxEndDate
            ,1 AS PseudoDenseRank
    FROM    BaseCTE b
    WHERE   b.RowNumber = 1
    UNION ALL
    SELECT   crt.PersonId
            ,crt.RowNumber

            ,crt.StartDate
            ,crt.EndDate
            ,CASE WHEN crt.EndDate > prev.MaxEndDate THEN crt.EndDate ELSE prev.MaxEndDate END
            ,CASE WHEN crt.StartDate <= prev.MaxEndDate THEN prev.PseudoDenseRank ELSE prev.PseudoDenseRank + 1 END
    FROM    RecursiveCTE prev
    INNER JOIN BaseCTE crt ON prev.PersonId = crt.PersonId
    AND     prev.RowNumber + 1 = crt.RowNumber
),  SumDaysPerPersonAndInterval
AS
(
    SELECT   src.PersonId
            ,src.PseudoDenseRank --Interval ID
            ,DATEDIFF(DAY, MIN(src.StartDate), MAX(src.EndDate)) Days
    FROM    RecursiveCTE src
    GROUP BY src.PersonId, src.PseudoDenseRank
)
SELECT  x.PersonId, SUM( x.Days ) DaysPerPerson
FROM    SumDaysPerPersonAndInterval x
GROUP BY x.PersonId
OPTION(MAXRECURSION 32767);

Solution 2:

DECLARE @Base TABLE --or a temporary table: CREATE TABLE #Base (...) 
(
     PersonID   INT NOT NULL
    ,StartDate  DATETIME NOT NULL
    ,EndDate    DATETIME NOT NULL
    ,RowNumber  INT NOT NULL
    ,PRIMARY KEY(PersonID, RowNumber)
);
INSERT  @Base (PersonID, StartDate, EndDate, RowNumber)
SELECT   e.PersonId
        ,e.StartDate
        ,e.EndDate
        ,ROW_NUMBER() OVER(PARTITION BY e.PersonID ORDER BY e.StartDate, e.EndDate) RowNumber
FROM    EventTable e;

WITH RecursiveCTE
AS
(
    SELECT   b.PersonId
            ,b.RowNumber

            ,b.StartDate
            ,b.EndDate
            ,b.EndDate AS MaxEndDate
            ,1 AS PseudoDenseRank
    FROM    @Base b
    WHERE   b.RowNumber = 1
    UNION ALL
    SELECT   crt.PersonId
            ,crt.RowNumber

            ,crt.StartDate
            ,crt.EndDate
            ,CASE WHEN crt.EndDate > prev.MaxEndDate THEN crt.EndDate ELSE prev.MaxEndDate END
            ,CASE WHEN crt.StartDate <= prev.MaxEndDate THEN prev.PseudoDenseRank ELSE prev.PseudoDenseRank + 1 END
    FROM    RecursiveCTE prev
    INNER JOIN @Base crt ON prev.PersonId = crt.PersonId
    AND     prev.RowNumber + 1 = crt.RowNumber
),  SumDaysPerPersonAndInterval
AS
(
    SELECT   src.PersonId
            ,src.PseudoDenseRank --Interval ID
            ,DATEDIFF(DAY, MIN(src.StartDate), MAX(src.EndDate)) Days
    FROM    RecursiveCTE src
    GROUP BY src.PersonId, src.PseudoDenseRank
)
SELECT  x.PersonId, SUM( x.Days ) DaysPerPerson
FROM    SumDaysPerPersonAndInterval x
GROUP BY x.PersonId
OPTION(MAXRECURSION 32767);
Bogdan Sahlean
  • 19,233
  • 3
  • 42
  • 57