Cumulatively Count Gaps in Sequential Numbers Results in Different Answers When New Data Added

Question

I asked a question a few days ago which you guys helped me solve and I am forever grateful! However, a new issue has presented itself and I am in need of your help once again!

Here's a link to the original problem: (R) Cumulatively Count Gaps in Sequential Numbers

I was trying to cumulatively counts gaps in sequential numbers for each selection of UniqueIDs. This was my dataset:

UniqueID  Month  
ABC123    1       
ABC123    2      
ABC123    3      
ABC123    4      
ABC123    6      
ABC123    7      
DEF456    3      
DEF456    4      
DEF456    10     
DEF456    11     
DEF456    12     
DEF456    14     
GHI789    2      
GHI789    3  
JKL012    12     
JKL012    13     
JKL012    14

Using your help, I tweaked the code provided from the link above as follows:

data2=data %>%
       group_by(UniqueID) %>%
       mutate(Skip = if_else(Month - lag(Month, default = first(Month) - 1) - 1 > 0, 1, 0),
       CountSkip = cumsum(Skip))

data2 = data2%>% 
       group_by(UniqueID) %>%
       mutate(LastValue = if_else(Month == last(Month), 1, 0))

data2=as.data.frame(data2)
data2$FinalTally=ifelse(data2$LastValue==1 & data2$Month!=14,1,0)
data2$SeqCount=data2$FinalTally+data2$CountSkip

This was the resulting dataset:

UniqueID  Month  Skip CountSkip LastValue  FinalTally   SeqCount
ABC123    1      0    0         0          0            0
ABC123    2      0    0         0          0            0
ABC123    3      0    0         0          0            0 
ABC123    4      0    0         0          0            0
ABC123    6      1    1         0          0            1
ABC123    7      1    2         1          1            2
DEF456    3      0    0         0          0            0
DEF456    4      0    0         0          0            0
DEF456    10     1    1         0          0            1
DEF456    11     1    1         0          0            1
DEF456    12     1    1         0          0            1  
DEF456    14     2    2         1          0            2
GHI789    2      0    0         0          0            0
GHI789    3      0    1         1          1            1
JKL012    12     0    0         0          0            0
JKL012    13     0    0         0          0            0 
JKL012    14     0    0         1          0            0

This is what I wanted...or so I thought.

When adding in new data for the next month (15), I edited the second to last line of my code to account for 15 being the new final month. However, I noticed the sum of SeqCount by Month differed from the sum of that same month before the new data was added. I filtered down to one month and found an example of one UniqueID where the SeqCount sum had differed.

Here is an example before the new data was included:

UniqueID  Month  Skip CountSkip LastValue  FinalTally   SeqCount
ZZZ999    2      0    0         0          0            0
ZZZ999    3      0    0         0          0            0
ZZZ999    4      0    0         0          0            0 
ZZZ999    5      0    0         0          0            0
ZZZ999    6      0    0         1          1            1

Here is the example when the new data was included:

UniqueID  Month  Skip CountSkip LastValue  FinalTally   SeqCount
ZZZ999    2      0    0         0          0            0
ZZZ999    3      0    0         0          0            0
ZZZ999    4      0    0         0          0            0 
ZZZ999    5      0    0         0          0            0
ZZZ999    6      0    0         0          0            0
ZZZ999    15     1    1         1          0            1

This is the problem: Month 6 loses a value of SeqCount when new data is added in.

My ultimate goal is to run a regression model for each month with SeqCount as the response with some other columns as predictors (I didn't include them for ease of reading). Whenever I add new data in, the response will change and my estimates will not be consistent.

Is there a way I can structure my code differently so when I add new data, the logic of the code does not change the information from previous values of SeqCount?

Any help would be appreciated!

Thank you!

SeqCount is based in part on FinalTally, which is based in part on LastValue, which is 1 for the ID's last row. When you add more data, the LastValue goes to the new row and no longer occurs in the old last row. — Jon Spring, Feb 01 '20 at 23:55
In other words, your code has a specific rule in it which "changes the information from previous values of SeqCount." I'm not sure this is solveable unless you can describe what you want that is different from that. — Jon Spring, Feb 02 '20 at 18:33
What I want to be able to do is to count how many times a month is skipped per UniqueID and have a running total that does not change - so it seems like the logic of the last value calculation screws that up (as you pointed out). Is there another way I can map this out? — user2813606, Feb 03 '20 at 17:19
What I don't understand is that it seems your first step does that already. The trouble for me is in the 2nd step where you seem to want to both detect the last row and also have it not change when there are more rows. Those seem like incompatible goals. — Jon Spring, Feb 03 '20 at 18:10
For the first part, you might more concisely use: `data %>% group_by(UniqueID) %>% mutate(skips_so_far = cumsum(Month > lag(Month, default = 0) + 1)) %>% ungroup()` — Jon Spring, Feb 03 '20 at 18:10
If I use the code as it is on the data you have shared I don't get the values you have shown in resulting dataset. Are you sure you have the correct values? Can you make your input clear (keep it only one instead of new and old) and show expected output for it. — Ronak Shah, Feb 04 '20 at 10:56

Patrick · Accepted Answer · 2020-02-04T22:57:24.417

The following seems to reproduce what you want without hardcoding any values into the logic.

Note - as others have commented, there appears to be a discrepancy between the results in the question and the code provided. For example, in the question post, UniqueID GHI789 for Month 3 has a CountSkip value of 1 though the code returns 0. The code in this answer returns 0.

data <- cbind.data.frame(UniqueID = c('ABC123','ABC123','ABC123','ABC123','ABC123','ABC123','DEF456','DEF456','DEF456','DEF456','DEF456','DEF456','GHI789','GHI789','JKL012','JKL012','JKL012'),
                         Month = c(1,2,3,4,6,7,3,4,10,11,12,14,2,3,12,13,14))

cartesian <- expand.grid(UniqueID = unique(as.character(data$UniqueID)),
                  Month = seq(from=min(data$Month), to=max(data$Month), by=1))

BA <- cartesian %>% 
  left_join(data %>% mutate(Month_orig=Month), by=c("UniqueID","Month")) %>% 
  arrange(UniqueID, Month) %>% 
  group_by(UniqueID) %>% 
  mutate(Skip = ifelse(Month==Month_orig & 
                         is.na(lag(Month_orig,1)) & 
                         Month!=min(Month[!is.na(Month_orig)]), 
                       1, 0)) %>%
  mutate(Skip = ifelse(Skip==1 & is.na(lag(Month_orig,2)), 0, Skip)) %>% # contstrain to only one skipped period?
  filter(!is.na(Month_orig)) %>% 
  mutate(CountSkip = cumsum(Skip)) %>% 
  mutate(LastValue = ifelse(Month==max(Month), 1, 0)) %>%
  mutate(FinalTally = ifelse(LastValue==1 & Month != max(Month),1,0)) %>% 
  mutate(SeqCount = FinalTally + CountSkip) %>% 
  select(-Month_orig)

BA
# A tibble: 17 x 7
# Groups:   UniqueID [4]
   UniqueID Month  Skip CountSkip LastValue FinalTally SeqCount
   <fct>    <dbl> <dbl>     <dbl>     <dbl>      <dbl>    <dbl>
 1 ABC123       1     0         0         0          0        0
 2 ABC123       2     0         0         0          0        0
 3 ABC123       3     0         0         0          0        0
 4 ABC123       4     0         0         0          0        0
 5 ABC123       6     1         1         0          0        1
 6 ABC123       7     0         1         1          0        1
 7 DEF456       3     0         0         0          0        0
 8 DEF456       4     0         0         0          0        0
 9 DEF456      10     0         0         0          0        0
10 DEF456      11     0         0         0          0        0
11 DEF456      12     0         0         0          0        0
12 DEF456      14     1         1         1          0        1
13 GHI789       2     0         0         0          0        0
14 GHI789       3     0         0         1          0        0
15 JKL012      12     0         0         0          0        0
16 JKL012      13     0         0         0          0        0
17 JKL012      14     0         0         1          0        0

In reading the question and comments more closely - I agree that a reference to the previous version of the table would be required to implement a solution. You might try rbind(old_data,new_rows) where new_rows is the result of processing all data and keeping only the unique(UniqueID, Month) not contained in the old_data. I don't fully follow the logic or intent here so perhaps that is not a good idea.

Cumulatively Count Gaps in Sequential Numbers Results in Different Answers When New Data Added

1 Answers1