0

Suppose I am having the following data set

data
    Group       Date
    A        2016-03-10
    A        2016-03-11
    A        2016-03-12
    A        2016-04-13
    A        2016-04-14
    A        2016-05-15
    A        2016-05-16
    A        2016-05-17
    B        2016-02-11
    B        2016-02-12
    B        2016-02-13
    B        2016-02-19
    B        2016-03-15

I want to find the different date intervals for each group. For example, for Group A, 2016-03-10 to 2016-03-12 should be interval 1, 2016-04-13 to 2016-04-14 should be interval 2 and 2016-05-15 to 2016-05-17 should be interval 3. I want to find where all there have been breaks and how many breaks have occurred for each group. In this way I can analyse that. This should be computed for each group. The following should be my ideal output,

Group       Date         Interval
A        2016-03-10         1
A        2016-03-11         1
A        2016-03-12         1
A        2016-04-13         2
A        2016-04-14         2
A        2016-05-15         3
A        2016-05-16         3
A        2016-05-17         3
B        2016-02-11         1
B        2016-02-12         1
B        2016-02-13         1
B        2016-02-19         2
B        2016-03-15         3 

The following are my tryings,

data %>% group_by(Group) %>% mutate(Date - lag(Date)) . 

This gives my output of NAs for first row, 1 whenever the date changes and 0 when it doesnt change. But I want something like 1,2,3 for each date interval.

Updated dataset for which it is not working,

 group       date     count
(factor)    (date)
1 Albany 2016-02-15    55
2 Albany 2016-02-16     1
3 Albany 2016-04-08    40
haimen
  • 1,985
  • 7
  • 30
  • 53

2 Answers2

6

You can cumsum on the difference vector where when the difference is not 1 the value is specified to be TRUE:

df %>% 
       group_by(Group) %>% 
       mutate(Interval = cumsum(Date - lag(Date, default = first(Date)) != 1))

# Source: local data frame [13 x 3]
# Groups: Group [2]

#    Group       Date Interval
#   <fctr>     <date>    <int>
#1       A 2016-03-10        1
#2       A 2016-03-11        1
#3       A 2016-03-12        1
#4       A 2016-04-13        2
#5       A 2016-04-14        2
#6       A 2016-05-15        3
#7       A 2016-05-16        3
#8       A 2016-05-17        3
#9       B 2016-02-11        1
#10      B 2016-02-12        1
#11      B 2016-02-13        1
#12      B 2016-02-19        2
#13      B 2016-03-15        3

Data:

df = structure(list(Group = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 2L, 2L, 2L, 2L, 2L), .Label = c("A", "B"), class = "factor"), 
    Date = structure(c(16870, 16871, 16872, 16904, 16905, 16936, 
    16937, 16938, 16842, 16843, 16844, 16850, 16875), class = "Date")), .Names = c("Group", 
"Date"), row.names = c(NA, -13L), class = "data.frame")
Psidom
  • 209,562
  • 33
  • 339
  • 356
  • I am getting an error of "Error in View : expecting a single value" for the command. It is ok until the date - lag(date) but after that, when we give that default = first(date) I am getting expecting single value.. Can you help ? – haimen Aug 11 '16 at 16:31
  • Not very sure. It's working for the example you gave. Is your `Date` column of class `Date`? Try `lapply(df, class)` to see what class you have for each column. – Psidom Aug 11 '16 at 16:37
  • It's really hard to guess what is happening without accessing to the real data. You can also try to detach and attach the `dplyr` package to see if that helps. – Psidom Aug 11 '16 at 16:57
  • I have updated a piece of original data for which it is not working. Can you check that? – haimen Aug 11 '16 at 16:57
  • Just checked. Worked as well. Which version of `dplyr` are you using? Try to upgrade `dplyr`, or try `default = date[1]` to replace the `first` function. – Psidom Aug 11 '16 at 17:01
  • just to make sure I am asking.. Is the closing brackets placing correct here? I feel there might be a mistake there because I am running the same code for same DF , but I am not getting the output – haimen Aug 11 '16 at 17:19
  • It is. I had no problem copying and pasting the code from here and get the same results. I also updated with the data I used to reproduce the result I have. – Psidom Aug 11 '16 at 17:25
  • I tried the data and code in two different systems. I am getting the same error. Unfortunately I am not able to reprise it many systems that I have tried here. – haimen Aug 11 '16 at 17:26
0

This is some sort of duplicate of this question here: Group rows in data frame based on time difference between consecutive rows

Basically you want to do these two operations:

df$gap <- c(0, (diff(df$date) > 1)*1) # identify gap between dates larger than 1
df$group <- cumsum(df$gap) + 1  # cumulative sum of 'gap' variable
Community
  • 1
  • 1
Maximilian Kohl
  • 632
  • 1
  • 9
  • 23