Sum unique day count within a date ranges

Question

I have a list of duplicate date ranges where I need to sum all dates unique to a campaign dates, without duplicating overlapping days.

For example in the data below, there are days in campaign_a running between 08/07/2022 and 15/07/2022 which overlap the 11/07/2022 and 13/07/2022 which need to not duplicate when summing.

The summing also needs to take the campaign name as a conditional.

campaign	days	line_start	line_end
campaign_a	108	14/07/2022	30/10/2022
campaign_a	61	31/10/2022	31/12/2022
campaign_a	2	11/07/2022	13/07/2022
campaign_a	2	8/07/2022	15/07/2022
campaign_a	108	14/07/2022	30/10/2022
campaign_a	61	31/10/2022	31/12/2022
campaign_a	2	11/07/2022	13/07/2022
campaign_a	2	8/07/2022	10/07/2022
campaign_b	108	14/07/2022	30/10/2022
campaign_b	61	31/10/2022	31/12/2022
campaign_b	2	11/07/2022	13/07/2022
campaign_b	2	8/07/2022	10/07/2022
campaign_b	108	14/07/2022	30/10/2022
campaign_b	61	31/10/2022	31/12/2022
campaign_b	2	11/07/2022	13/07/2022
campaign_b	2	8/07/2022	10/07/2022
campaign_b	108	14/07/2022	30/10/2022
campaign_b	61	31/10/2022	31/12/2022
campaign_b	2	11/07/2022	13/07/2022
campaign_b	2	8/07/2022	10/07/2022

please update your question (and data) to explain overlapping dates in the same campaign to count as one. It's nowhere in your question and Scott Craner's answer answers your question as asked. Jos Wooley managed to answer your question as you intended it solely by your comments underneath Scott's answer. The problem you encountered is a challenging problem. If you update your question it would become a question worth upvoting. — P.b, Nov 08 '22 at 11:48

score 3 · Accepted Answer · edited Nov 09 '22 at 20:40

Again, assuming "campaign_a", for example, in G2:

=LET(
    δ, A$2:D$21,
    ζ, FILTER(δ, INDEX(δ, , 1) = G2),
    α, INDEX(ζ, , 3),
    β, INDEX(ζ, , 4),
    ξ, SEQUENCE(MAX(β) - MIN(α) + 1, , MIN(α)),
    γ, BYROW(ξ, LAMBDA(λ, SUM(BYROW(CHOOSE({1, 2}, α, β), 
       LAMBDA(κ, N(MEDIAN(κ, λ) = λ)))))),
       COUNT(FILTER(ξ, γ > 0))
)

Copy down to give similar results for campaigns in G3, G4, etc.

Explanation

The best way to understand this approach is to consider a simple example like the following one using numbers:

Campaing	Start	End
campaing_a	1	1
campaing_a	1	3
campaing_a	5	5
campaing_a	5	5
campaing_a	6	6

What it does is build a grid with a timeframe from 1 to 6. This is the SEQUENCE part of the formula (it is represented in the D1:H1 range see next screenshot):

 SEQUENCE(MAX(β) - MIN(α) + 1, , MIN(α))

The MEDIAN calculation is a way of filling the grid (schedule) with 1 or 0 values. It ensures that from the start to the end of each row is filled with 1's. The calculation is as follows:

MEDIAN(start, end, counter)

Let's say for the interval [2,4] we can check the following output based on different counters (from 1 to 5):

MEDIAN(2,4,1) -> 2 <> 1 (out of the interval)
MEDIAN(2,4,2) -> 2 =  2
MEDIAN(2,4,3) -> 3 =  3
MEDIAN(2,4,4) -> 4 =  4 
MEDIAN(2,4,5) -> 4 <> 5 (out of the interval)

So if counter is within the range of [start, end], then the MEDIAN returns counter. Therefore:

MEDIAN(start, end, counter) = counter

is a way to generate 1 for the entire period from start to end in the grid, otherwise returns 0.

Note: It is important to consider that this solution assumes a non-empty start or end value, otherwise it gives a wrong result, because MEDIAN(B1, C1, counter) if B1 and C1 are empty it returns counter.

Here is the graphical representation of the process described before:

The SUM on each column counts the number of overlaps per column (per day). Because we don't want to count the overlaps as part of the duration, we just need to count if the sum is bigger than 0. For example:

=FILTER(D1:I1,D7:I7>0) -> [1,2,3,5,6]

so the total duration is 5 (counting the number of items of FILTER output)

The rest is to put this logic in Excel in a way you can iterate over all elements.

Note: If the range of the grid is too wide, let's say you have three items:

Item   Start End
item1  1     2
item2  3     4
item3  5     1000

it generates a grid from 1-1000 elements just for calculating the total duration of three items, so take this into consideration that for a large data set with such characteristics it may impact the performance.

Very clever using MEDIAN to compare inside lambda. I thought about using the SUMPRODUCT of (sequence of days from min to max > line_start and the sequence 0, but I like this approach. — P.b, Nov 08 '22 at 08:23
`=LET(unq,UNIQUE(Table1[campaign]), ucount,BYROW(unq,LAMBDA(u,LET( data, Table1, campaign, FILTER(data, INDEX(data, , 1) = u), start, INDEX(campaign, , 3), end, INDEX(campaign, , 4), seq, SEQUENCE(MAX(end) - MIN(start) + 1, , MIN(start)), br, BYROW(seq, LAMBDA(b, SUM(BYROW(CHOOSE({1,2}, start, end), LAMBDA(a, N(MEDIAN(a, b) = b)))))), COUNT(FILTER(seq, br > 0)) ))),HSTACK(unq,ucount)) ` this would spill the data, without dragging — P.b, Nov 08 '22 at 11:40
@JosWoolley great solution!, at first I didn't understand this, but after taking a closer look I was able to understand it. I updated your answer to provide an explanation. It is a *very beautiful solution and I think it deserves to be understood* so I added an explanation of how it works. If you do not agree with the edit, please roll it back, or just make the changes you consider — David Leal, Nov 09 '22 at 17:06
Thank you both for the solution and the explanation I was able to repurpose it to fit a slightly different layout in my sheet through trial an error, having the visual explanation definitely helps understand what’s going on here. — Jonathan Devereux, Nov 09 '22 at 19:17
@DavidLeal Of course I agree! Fantastic explanation, and I'm very grateful to you indeed for taking the time to provide such a detailed breakdown. — Jos Woolley, Nov 09 '22 at 20:26

David Leal · Answer 2 · 2022-11-09T05:28:27.503

Basically, this problem is similar to calculating the project's total duration (in days). Where in this case each campaign's data can be assimilated as project tasks with start and end dates. The only difference here is that gaps don't contribute to project duration.

Let's illustrate it with the following example, since dates are represented in Excel as a positive integer, to simplify the calculation we can use just numbers as follow (Sample Data 1):

From the graph is easy to identify that the total duration will be 13 days. Notice that the gap in columns K and P doesn't count for the total duration.

Based on this graphical representation we can build a logic for calculating the durations as we progress from the top of the Gantt chart to the bottom.

The logic to calculate the duration is based on identifying overlapping duration for each task. The condition to identify two intervals A,B overlap is the following:

AND(startA <= endB, endA >= startB)

we are going to iterate over all tasks and on each step update the new start and end dates and update the duration. Let's say we are going to calculate the new start and end dates for two consecutive rows represented by A and B intervals. The combined interval will be:

[MIN(startA, startB), MAX(endA, endB)]

If the previous duration is represented by the variable duration, then the updated duration will be:

IF(endB > endA, duration + (endB - endA), duration)))

i.e, if the next interval (B) ends after A, then we need to update the duration with the difference (endB - endA).

If the intervals don't overlap, then the duration will be updated as follow:

duration = duration + endB - startB + 1

and the next start and end date will be the startB, endB for the next calculation.

The above process can now be put in Excel terms as follow:

=LET(set, $A$2:$C$12, campaing, T2, campaings, INDEX(set,,1),
 subset, SORT(FILTER(set, campaings=campaing),2),starts, INDEX(subset,,2),
 ends, INDEX(subset,,3),intervals, starts&","&ends,
 SPLIT, LAMBDA(x, 1*TEXTSPLIT(x,",")), JOIN, LAMBDA(y, TEXTJOIN(",",,y)),
 calc, REDUCE("0,0,0", intervals, LAMBDA(acc,i, LET(accValues, SPLIT(acc),
   startA, INDEX(accValues,1,1), endA, INDEX(accValues,1,2),
   duration, INDEX(accValues,1,3), values, SPLIT(i), startB, INDEX(values,1,1), 
   endB, INDEX(values,1,2),
   IF(duration=0,JOIN(HSTACK(startB, endB, endB-startB+1)),
    IF(AND(startA <= endB, endA >= startB),
      JOIN(HSTACK(MIN(startA, startB), MAX(endA, endB),
        IF(endB > endA, duration + (endB-endA), duration))),
      JOIN(HSTACK(startB, endB, duration + endB-startB+1))
    ))))), INDEX(SPLIT(calc),1,3)
)

and here is the output:

Back to the original input data from the question for both campaigns, the output will be:

Formula Explanation

Now that we know the logic to calculate the total duration, let's explain the excel implementation.

We use LET function to define the required variables and to avoid repeating the same calculation.

In order to be able to build the logic explained before, we need to sort the input data based on the start date.

For the calculation, we need to use an Excel function that allows using a cumulative calculation based on the previous iteration. We can use it for that REDUCE or SCAN. We use REDUCE because at the end we need a scalar number with the total, not an array, but SCAN is a good option for debugging purposes. In order to do that you can replace REPLACE with SCAN and the following line at the end: INDEX(SPLIT(calc),,3) with: calc.

For Sample Data 1 this will be SCAN output, so you can check in each iteration the CVS values stored in the accumulator (start, end, duration)

1,2,2
1,4,4
1,4,4
5,6,6
7,7,7
9,10,9
11,11,10
11,11,10
12,12,11
14,14,12
14,15,13

Because on every iteration, we need to track three values: start date, end date, and duration, but REDUCE function iterates over a single element of the input array. We are going to build a record in string format via Comma-separated-value format (CSV) and then extract the information back in every iteration. In order to do that we define the following two user LAMBDA functions:

SPLIT, LAMBDA(x, 1*TEXTSPLIT(x,","))
JOIN, LAMBDA(y, TEXTJOIN(",",,y))

Notice because we are dealing with numbers we ensure the SPLIT function returns numbers, not texts as it is the output of TEXTSPLIT. We do the conversion by multiplying the output of TEXTSPLIT by 1.

The accumulator (acc) was initialized as all values equal to zero (0,0,0), so for the first iteration we just assign the information from the first row (start, end, and duration), then we iterate following the logic explained before and use variables to extract back the information stored in CSV format.

Array Version

Here is the formula for the array version, that returns the unique days for all the campaigns:

=LET(set, $W$2:$Y$21, campaings, INDEX(set,,1), campaingsUx, UNIQUE(campaings),
 SPLIT, LAMBDA(x, 1*TEXTSPLIT(x,",")), JOIN, LAMBDA(y, TEXTJOIN(",",,y)),
 byRowResult, BYROW(campaingsUx, LAMBDA(campaing, LET(subset, 
  SORT(FILTER(set, campaings=campaing),2),starts, INDEX(subset,,2),
  ends, INDEX(subset,,3),intervals, starts&","&ends,
  calc, REDUCE("0,0,0", intervals, LAMBDA(acc,i, LET(accValues, SPLIT(acc),
    startA, INDEX(accValues,1,1), endA, INDEX(accValues,1,2),
    duration, INDEX(accValues,1,3), values, SPLIT(i),
    startB, INDEX(values,1,1),  endB, INDEX(values,1,2),
    IF(duration=0,JOIN(HSTACK(startB, endB, endB-startB+1)),
     IF(AND(startA <= endB, endA >= startB),
       JOIN(HSTACK(MIN(startA, startB), MAX(endA, endB),
         IF(endB > endA, duration + (endB-endA), duration))),
       JOIN(HSTACK(startB, endB, duration + endB-startB+1))))
  ))), INDEX(SPLIT(calc),1,3)
 ))), HSTACK(campaingsUx, byRowResult)
)

And the output will be:

campaign_a  177
campaign_b  177

Sum unique day count within a date ranges

2 Answers2

Explanation

Formula Explanation

Array Version

Linked