R: Grouping descrete time series by years

Question

I'm working on summer time series of drought period data and trying to identify individual periods. My problem is that the code I'm currently using does not recognize when a year changes so it assigns the same id for the end of summer and the beginning of the next summer.

Here's a simplified version of the data I have.

myData <- tibble(series = rep("FS",21),
                 date = c("2016-10-26","2016-10-27","2016-10-28","2016-10-29","2016-10-30","2016-10-31","2017-05-01","2017-05-02","2017-05-03","2017-05-04","2017-05-05","2017-05-06","2017-05-07","2017-05-08","2017-05-09","2017-05-10","2017-05-11","2017-05-12","2017-05-13","2017-05-14","2017-05-15"),
                 year = c(rep(2016,6),rep(2017,15)),
                 day_status = c(rep("normal",3),rep("drought",16),rep("normal",2)))

> myData
# A tibble: 21 x 4
   series date        year day_status
   <chr>  <chr>      <dbl> <chr>     
 1 FS     2016-10-26  2016 normal    
 2 FS     2016-10-27  2016 normal    
 3 FS     2016-10-28  2016 normal    
 4 FS     2016-10-29  2016 drought   
 5 FS     2016-10-30  2016 drought   
 6 FS     2016-10-31  2016 drought   
 7 FS     2017-05-01  2017 drought   
 8 FS     2017-05-02  2017 drought   
 9 FS     2017-05-03  2017 drought   
10 FS     2017-05-04  2017 drought   
# ... with 11 more rows

The result I'm looking for is something like this

> myData2
# A tibble: 21 x 5
   series date        year day_status group
   <chr>  <chr>      <dbl> <chr>      <dbl>
 1 FS     2016-10-26  2016 normal         1
 2 FS     2016-10-27  2016 normal         1
 3 FS     2016-10-28  2016 normal         1
 4 FS     2016-10-29  2016 drought        2
 5 FS     2016-10-30  2016 drought        2
 6 FS     2016-10-31  2016 drought        2
 7 FS     2017-05-01  2017 drought        3
 8 FS     2017-05-02  2017 drought        3
 9 FS     2017-05-03  2017 drought        3
10 FS     2017-05-04  2017 drought        3
# ... with 11 more rows

The code I have been using is myData$group <- with(myData, rep(seq_along(z<-rle(myData$day_status)$lengths),z)) but it assigns droughts from October and May as the same drought which is not the case.

I tried then use dplyr and group_by to make the function run for one year at the time

  group_by(year) %>%
  mutate(group = rep(seq_along(z<-rle(myData$day_status)$lengths),z)) %>%
  ungroup() %>%
  {. ->> myData}

but this gives an error Error: Column group must be length 6 (the group size) or one, not 21. I gathered this has something to do with how the group_by works, but I don't fully understand what is the problem. Any help is greatly appreciated!

score 0 · Accepted Answer · answered Aug 04 '20 at 08:05

0

For such cases I make use of rle:

rleLengths <- rle(paste0(myData$year, myData$day_status))$lengths


myData <- myData %>%
  mutate(group = rep(1:length(rleLengths), rleLengths)

myData$group

[1] 1 1 1 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 4 4

answered Aug 04 '20 at 08:05

Martin Schmelzer

23,283
6
73
98

Thank you! This worked perfectly. Just a question, why does rleLengths need to be defined separately? – LaHN Aug 04 '20 at 09:38
It does not have to be defined separatley. You could call it again and again within the mutation, but that would be redundant. – Martin Schmelzer Aug 04 '20 at 10:54

score 0 · Answer 2 · answered Aug 04 '20 at 08:05

You can use cur_group_id in dplyr 1.0.0

library(dplyr)
myData %>% group_by(year, day_status) %>% mutate(group = cur_group_id())

If you want the groups to be created as they appear a base R option is :

myData <- transform(myData, group = paste0(year, day_status))
transform(myData, group = match(group, unique(group)))


#   series       date year day_status group
#1      FS 2016-10-26 2016     normal     1
#2      FS 2016-10-27 2016     normal     1
#3      FS 2016-10-28 2016     normal     1
#4      FS 2016-10-29 2016    drought     2
#5      FS 2016-10-30 2016    drought     2
#6      FS 2016-10-31 2016    drought     2
#7      FS 2017-05-01 2017    drought     3
#8      FS 2017-05-02 2017    drought     3
#9      FS 2017-05-03 2017    drought     3
#10     FS 2017-05-04 2017    drought     3
#11     FS 2017-05-05 2017    drought     3
#12     FS 2017-05-06 2017    drought     3
#13     FS 2017-05-07 2017    drought     3
#14     FS 2017-05-08 2017    drought     3
#15     FS 2017-05-09 2017    drought     3
#16     FS 2017-05-10 2017    drought     3
#17     FS 2017-05-11 2017    drought     3
#18     FS 2017-05-12 2017    drought     3
#19     FS 2017-05-13 2017    drought     3
#20     FS 2017-05-14 2017     normal     4
#21     FS 2017-05-15 2017     normal     4

Hmmm, I tried using the cur_group_id() but I think I did not do it correctly. It always gave the same number for all the normal periods and drought periods during the same year. — LaHN, Aug 04 '20 at 09:41
That's weird. It gives me 4 groups for your data. Did you do `group_by` first? Did the base R option work? — Ronak Shah, Aug 04 '20 at 09:44
I tried using it with my whole data, so maybe that's causes the problem. As I have several series, I used group_by(series, year, day_status). But even with the simplified data the group order is a little funny for me: it's 2,2,2,1,1,1,3,3,3. The base R option worked perfectly! — LaHN, Aug 04 '20 at 10:06

R: Grouping descrete time series by years

2 Answers2