T-SQL Fewest Sets of Common Dates that Includes All Row IDs

Question

My table (@MyTable) is a list of IDs with start dates and end dates (inclusive) that represent an interval of days when the ID appears in a file that is received once per day:

ID    Start_Date    End_Date
1     10/01/2014    12/15/2014
2     11/05/2014    03/03/2015
3     12/07/2014    12/09/2014
4     04/01/2015    04/15/2015

Each ID appears only once, i.e. only has 1 associated time interval, and intervals between Start_Dates and End_dates can (but not necessarily) overlap across different IDs. I need a SQL query to find the sets of dates where each ID will appear at least once when the files from these sets of dates are merged, in the smallest number of dates as possible. In the table above the solution could be these 2 dates:

File_Date     ID(s)
12/07/2015    1,2,3
04/01/2015    4

But for the example any 1 date between ID(3)'s Start_date and End_date & combined with 1 date between ID(4)'s Start_date and End_date would be a solution.

The actual data consists of 10,000 different IDs. The date range of possible file dates is 04/01/2014 - 07/01/2015. Each daily file is very large in size and must be downloaded manually, hence I want to minimize the number I must download to include all IDs.

So far I have a CTE that results in separate rows for all dates between the Start_Date and End_date of each ID:

;WITH cte (ID, d)
AS
(
    SELECT 
        tbl.ID AS ID,
        tbl.Start_Date AS d
    FROM @MyTable tbl
    UNION ALL
    SELECT 
        tbl.ID AS ID,
        DATEADD(DAY, 1, cte.d) AS d
    FROM cte
    INNER JOIN 
    @MyTable tbl ON cte.ID = tbl.ID
    WHERE cte.d < tbl.End_Date
)
SELECT
    ID AS ID,
    d AS File_Date 
FROM cte
ORDER BY ID,d
OPTION (MaxRecursion 500)

Using @MyTable example results are:

ID    File_Date
1     10/01/2014
1     10/02/2014
1     10/03/2014
1     etc...

My thinking was to determine the most common File_Date among all the IDs, then pick the next most common File_Date among all the IDs left, and so on...but I'm stuck. To put it in more mathy terms, I am trying to find the fewest sets (File_Dates) that contain all the items (IDs), similar to https://softwareengineering.stackexchange.com/questions/263095/finding-the-fewest-sets-which-contain-all-items, but I don't care about minimizing duplicates. The final results do not have to include which IDs appear in which File_Dates; I just need to know all the File_Dates.

I'm using MS SQL Server 2008.

This type of optimization is not something that SQL is well-suited for. I think this might be variant of a shortest path problem. — Gordon Linoff, Jul 16 '15 at 20:55
Unfortunately that's the only tool I have at hand...I don't need a perfect solution. I can brute force the last 10% if needed. — RCheskin, Jul 16 '15 at 21:36

score 0 · Accepted Answer · answered Jul 17 '15 at 01:58

Just go on with what you started. The result found by this method is not optimal, but could be good enough for your purposes.

For each ID generate a set of rows for each day in the range. You already know how to do it, though I'd use a table of numbers for it, rather than generating it on the fly with CTE every time, but it doesn't really matter.

Put result into a temporary table. It will have 10,000 IDs * ~400 days = ~4M rows. The temp table has two columns (ID, FileDate). Create appropriate indexes. I'd start with two: on (ID, FileDate) and on (FileDate, ID). Make one of them clustered and primary key. I'd try to make (FileDate, ID) as clustered primary key.

Then process in a loop:

Find a date that has most number of IDs:

SELECT TOP(1) @VarDate = FileDate
FROM #temp
GROUP BY FileDate
ORDER BY COUNT(*) DESC;

Remember found date (and optionally its IDs) in another temp table for the final result.

Delete date and IDs that correspond to this date from the big table.

DELETE FROM #temp
WHERE FileDate = @VarDate
OR ID IN
(
    SELECT t2.ID
    FROM #temp AS t2
    WHERE t2.FileDate = @VarDate
)

Repeat the loop until there are no rows in #temp.

score 0 · Answer 2 · edited May 23 '17 at 11:46

Using Vladimir B.'s suggested approach and the answer from In SQL Server, how to create while loop in select as a model:

;WITH cte (ID, d)
AS
(
    SELECT 
        tbl.ID AS ID,
        tbl.Start_Date AS d
    FROM @MyTable tbl
    UNION ALL
    SELECT 
        tbl.ID AS ID,
        DATEADD(DAY, 1, cte.d) AS d
    FROM cte
    INNER JOIN 
    @MyTable tbl ON cte.ID = tbl.ID
    WHERE cte.d < tbl.End_Date
)
SELECT
    ID AS ID,
    d AS File_Date
    into #temp2
FROM cte
ORDER BY ID,d
OPTION (MaxRecursion 500)

Create Table #FileDates
(
File_Date date
)

GO

DECLARE @VarDate date

WHILE EXISTS (select * from #temp2)

BEGIN

SELECT TOP(1) 
@VarDate = File_Date
FROM #temp2
GROUP BY File_Date
ORDER BY COUNT(*) DESC;

INSERT INTO #FileDates (File_Date)
Values (@VarDate)

DELETE from #temp2
WHERE File_Date=@VarDate
OR ID in
(
    select t2.ID
    from #temp2 as t2
    where t2.File_Date = @VarDate
)

END

SELECT *
FROM #FileDates
ORDER BY File_Date

Took 30 seconds to return 40 file dates for approx. 4,000 IDs. Thank you very much Mr. Baranov!

T-SQL Fewest Sets of Common Dates that Includes All Row IDs

2 Answers2