0

I have a tricky problem I'm trying to solve:

I have data that looks like the following sample:

UniqueID  Month  
ABC123    1       
ABC123    2      
ABC123    3      
ABC123    4      
ABC123    6      
ABC123    7      
DEF456    3      
DEF456    4      
DEF456    10     
DEF456    11     
DEF456    12     
DEF456    14     
GHI789    2      
GHI789    3  
JKL012    12     
JKL012    13     
JKL012    14         

The UniqueID is unique per month. The month column refers to a particular month. For example: 1=October 2018, 2=November 2019, and so on. We have a total of 14 different months for which we have data. I want to cumulatively count the number of times we skip a month and when the final month per UniqueID is not 14. The starting month does not factor into the calculation. The resulting calculation would result in the following sample:

UniqueID  Month  CountSkip
ABC123    1      0  
ABC123    2      0
ABC123    3      0
ABC123    4      0
ABC123    6      1
ABC123    7      2
DEF456    3      0
DEF456    4      0
DEF456    10     1
DEF456    11     1
DEF456    12     1
DEF456    14     2
GHI789    2      0
GHI789    3      1
JKL012    12     0
JKL012    13     0
JKL012    14     0

I have a snippet to calculate the total number of skips by doing the following:

data %>% 
  group_by(UniqueID) %>%
  mutate(Skipped = sum(diff(Month) > 1))

How could I modify this to cumulatively count the skips and also account for the last month value not being 14?

Any help would be appreciated! Thank you!

user2813606
  • 797
  • 2
  • 13
  • 37

2 Answers2

2

Here is one approach. Let me know if this has the logic you are looking for.

library(tidyverse)

data %>%
  group_by(UniqueID) %>%
  mutate(Skip = if_else(Month - lag(Month, default = first(Month) - 1) - 1 > 0 | 
                          (Month == last(Month) & Month != 14), 1, 0),
         CountSkip = cumsum(Skip))

# A tibble: 17 x 4
# Groups:   UniqueID, CountSkip [9]
   UniqueID Month  Skip CountSkip
   <chr>    <int> <dbl>     <dbl>
 1 ABC123       1     0         0
 2 ABC123       2     0         0
 3 ABC123       3     0         0
 4 ABC123       4     0         0
 5 ABC123       6     1         1
 6 ABC123       7     1         2
 7 DEF456       3     0         0
 8 DEF456       4     0         0
 9 DEF456      10     1         1
10 DEF456      11     0         1
11 DEF456      12     0         1
12 DEF456      14     1         2
13 GHI789       2     0         0
14 GHI789       3     1         1
15 JKL012      12     0         0
16 JKL012      13     0         0
17 JKL012      14     0         0

Data (from @akrun)

data <- structure(list(UniqueID = c("ABC123", "ABC123", "ABC123", "ABC123", 
                                    "ABC123", "ABC123", "DEF456", "DEF456", "DEF456", "DEF456", "DEF456", 
                                    "DEF456", "GHI789", "GHI789", "JKL012", "JKL012", "JKL012"), 
                       Month = c(1L, 2L, 3L, 4L, 6L, 7L, 3L, 4L, 10L, 11L, 12L, 
                                 14L, 2L, 3L, 12L, 13L, 14L)), class = "data.frame", row.names = c(NA, 
                                                                                                   -17L))
Ben
  • 28,684
  • 5
  • 23
  • 45
  • 2
    Not clear why you want the last `group_by` `data %>% group_by(UniqueID) %>% mutate(Skip = cumsum(c(FALSE, diff(Month) > 1) |(Month == last(Month) & Month != 14)))` – akrun Jan 28 '20 at 23:10
  • 2
    My mistake - thanks for correcting me! I'll edit the answer. I forgot I already had grouped, and just need to mutate at the end. – Ben Jan 29 '20 at 00:10
1

We can use cumsum directly on the diff logical vector

library(dplyr)
data %>% 
   group_by(UniqueID) %>%
   mutate(i1 = c(FALSE, diff(Month) > 1)) %>%
   group_by(grp = cumsum(c(TRUE, i1[-1])), add = TRUE) %>%
   mutate(Count = row_number() * any(i1) ) %>%
   ungroup %>%
   select(-i1, -grp)
# A tibble: 17 x 3
#   UniqueID Month Count
#   <chr>    <int> <int>
# 1 ABC123       1     0
# 2 ABC123       2     0
# 3 ABC123       3     0
# 4 ABC123       4     0
# 5 ABC123       6     1
# 6 ABC123       7     2
# 7 DEF456       3     0
# 8 DEF456       4     0
# 9 DEF456      10     1
#10 DEF456      11     2
#11 DEF456      12     3
#12 DEF456      14     1
#13 GHI789       2     0
#14 GHI789       3     0
#15 JKL012      12     0
#16 JKL012      13     0
#17 JKL012      14     0

data

data <- structure(list(UniqueID = c("ABC123", "ABC123", "ABC123", "ABC123", 
"ABC123", "ABC123", "DEF456", "DEF456", "DEF456", "DEF456", "DEF456", 
"DEF456", "GHI789", "GHI789", "JKL012", "JKL012", "JKL012"), 
    Month = c(1L, 2L, 3L, 4L, 6L, 7L, 3L, 4L, 10L, 11L, 12L, 
    14L, 2L, 3L, 12L, 13L, 14L)), class = "data.frame", row.names = c(NA, 
-17L))
akrun
  • 874,273
  • 37
  • 540
  • 662
  • Hey akrun, thanks for answering my question! I used your code and got zeros for the first occurrence of each UniqueID, but all the other values were NA. Also, I got 4 lines of warning messages --> "In Ops.factor(diff(Month),1) : '>' not meaningful for factors – user2813606 Jan 28 '20 at 19:36
  • 1
    @user2813606 You should provide a [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). Use `dput(head(data,17))` instead of copy/pasting. – M-- Jan 28 '20 at 19:37
  • I also noticed the counts are not quite right: Rows 10, 11, 12, and 14 are not correct – user2813606 Jan 28 '20 at 19:43
  • @user2813606 If you check `6 and 7`, i used the same approach – akrun Jan 28 '20 at 19:44
  • Row 10-12 should be 1, 1, and then 2, and then Row 14 should be 1 since it is the last value and its value does not equal 14 – user2813606 Jan 28 '20 at 19:47
  • @user2813606. I find the logic of 6 and 7 and 10:12 the same – akrun Jan 28 '20 at 19:48