Create an SQL query which returns date-dependent aggregate counts for distinct individuals and cumulative count of events in a table

Question

--code to create a minimally reproduceable example
CREATE TABLE mre (
recid int,
caseID int,
t1 date,
t2 date,
t3 date,);
go    
INSERT INTO mre (recid, caseID, t1, t2, t3)
VALUES (1, 1,'2018-04-01','2019-05-05','2019-05-05'),
(2, 2,'2006-10-12','2007-03-10',NULL),
(3, 3,'2018-05-05','2019-08-07','2019-07-09'),
(4, 4,'2005-07-01','2020-10-10',NULL),  
(5, 5,'2018-08-25','2022-01-01','2021-12-12'),
(6, 5,'2018-08-25','2022-01-01','2021-12-12'),
(7, 6,'2020-09-20','2022-01-13','2022-01-04')
go
--here is the record set
select*
from mre
go
--create a table mimicking the desired query output 
create table qryout (t3 date, perscount int, eventcount int)
go
insert into qryout (t3,perscount,eventcount )
VALUES('2019-05-05',4, 1),
('2019-07-09',3, 2),
('2021-12-12',2, 4),
('2022-01-04',1, 5)
--qry output should look like this 
select*
from qryout

In the minimally reproduceable example, I have a record set with seven records, each with three time points. ‘recid’ is the unique record identifier, ‘caseid’ uniquely identifies an individual. ‘t1’ is the time someone entered a population, ‘t2’ is the time someone left a population and ‘t3’ is the time of an event occurring involving a person. If an event did not occur ‘t3’ contains no data.

For each event date, I want to know how many distinct individuals were in the total population at the time by counting those who had already joined by t3 (at or before t3) and had not left before t3, though they could leave on t3 and still be counted. Note that person number 5 happened to have two events on the same date. This person should not be counted twice in the number of people that were around at this time point.

So for the first instance of t3, involving person number 1, people 1,2,3 & 4 were in the sample at the time. Person 6 joined too late to be counted and person 2 had already left, so a total of four people were around when this event occurred.

For the second instance, subjects 3,4, & 5 were in the sample, a total of 3, and for the remaining two time points there were 2 and 1 respectively.

Also, at each time point I would like to know how many events had occurred so far. I can have multiple events occurring on the same day for the same person as stated above. So, the events involving person 5 need to be both counted in the total number of events.

I need a query which can output this summary. Computational efficiency is a consideration, as the datasets involved are fairly large (3-4M observations). I feel this may be a job for a count within a case statement, but have been unable to operationalise this. Currently I'm doing this in SQL server 2008. I am migrating the project to an SQL 2019 environment in the near future, so if there is a better solution using this version I'd be interested in looking at this as well. Any help gratefully received.

score 1 · Answer 1 · answered Feb 17 '22 at 15:15

The following query should work for you:

Unpivot the three date columns, conditionally adding t3 only where it is not null, and add a column showing how much the population changes.
Group it by the unpivoted date.
- Sum the difference, and take a running sum of that (a window function over the normal aggregation).
- Count also the number of rows which have 0 diff, in other words an event has occured.
Filter on only the rows where there is at least one event.
Take also a running sum of the number of events.
Note that this query only requires a single scan of the base table, and a single sort.

WITH Population AS (
    SELECT
      v.EventDate,
      DiffQty = SUM(v.DiffQty),
      CumeSum = SUM(SUM(v.DiffQty)) OVER (ORDER BY v.EventDate ROWS UNBOUNDED PRECEDING),
      EventsCount = COUNT(CASE WHEN DiffQty = 0 THEN 1 END)
    FROM mre
    CROSS APPLY (
        SELECT t1, 1
        UNION ALL
        SELECT t2, -1
        UNION ALL
        SELECT t3, 0
        WHERE t3 IS NOT NULL
    ) v(EventDate, DiffQty)
    GROUP BY
      v.EventDate
)
SELECT
  EventDate,
  PersonCount = CumeSum,
  CumeEventsCount = SUM(EventsCount) OVER (ORDER BY EventDate ROWS UNBOUNDED PRECEDING)
FROM Population
WHERE EventsCount > 0;

db<>fiddle

This only works in SQL Server 2012 and later. In earlier versions (which I strongly recommend you migrate away from) you need to use one of these hacks.

Apologies for not getting back on this post. I needed to engineer an emergency solution in SQL because I could not get an R based solution to work, however I have solved that now. As this is a survival problem, it was better to solve it using a survival package. I was unable to try your solution, which looks very helpful, as I've not migrated the original dataset from SQL2008 yet. From the link you provided, I guess I'd need some sort of solution based on cursors? — gavinr, Mar 10 '22 at 13:48
Either that or a triangular join or SQLCLR. The first two of these are likely to be very slow, it may be faster to export the data and use a procedural language, or import it into a newer version of SQL Server (the Express and Developer Editions are free) — Charlieface, Mar 10 '22 at 13:51

Create an SQL query which returns date-dependent aggregate counts for distinct individuals and cumulative count of events in a table

1 Answers1