1

I'm working on summer time series of drought period data and trying to identify individual periods. My problem is that the code I'm currently using does not recognize when a year changes so it assigns the same id for the end of summer and the beginning of the next summer.

Here's a simplified version of the data I have.

myData <- tibble(series = rep("FS",21),
                 date = c("2016-10-26","2016-10-27","2016-10-28","2016-10-29","2016-10-30","2016-10-31","2017-05-01","2017-05-02","2017-05-03","2017-05-04","2017-05-05","2017-05-06","2017-05-07","2017-05-08","2017-05-09","2017-05-10","2017-05-11","2017-05-12","2017-05-13","2017-05-14","2017-05-15"),
                 year = c(rep(2016,6),rep(2017,15)),
                 day_status = c(rep("normal",3),rep("drought",16),rep("normal",2)))

> myData
# A tibble: 21 x 4
   series date        year day_status
   <chr>  <chr>      <dbl> <chr>     
 1 FS     2016-10-26  2016 normal    
 2 FS     2016-10-27  2016 normal    
 3 FS     2016-10-28  2016 normal    
 4 FS     2016-10-29  2016 drought   
 5 FS     2016-10-30  2016 drought   
 6 FS     2016-10-31  2016 drought   
 7 FS     2017-05-01  2017 drought   
 8 FS     2017-05-02  2017 drought   
 9 FS     2017-05-03  2017 drought   
10 FS     2017-05-04  2017 drought   
# ... with 11 more rows

The result I'm looking for is something like this

> myData2
# A tibble: 21 x 5
   series date        year day_status group
   <chr>  <chr>      <dbl> <chr>      <dbl>
 1 FS     2016-10-26  2016 normal         1
 2 FS     2016-10-27  2016 normal         1
 3 FS     2016-10-28  2016 normal         1
 4 FS     2016-10-29  2016 drought        2
 5 FS     2016-10-30  2016 drought        2
 6 FS     2016-10-31  2016 drought        2
 7 FS     2017-05-01  2017 drought        3
 8 FS     2017-05-02  2017 drought        3
 9 FS     2017-05-03  2017 drought        3
10 FS     2017-05-04  2017 drought        3
# ... with 11 more rows

The code I have been using is myData$group <- with(myData, rep(seq_along(z<-rle(myData$day_status)$lengths),z)) but it assigns droughts from October and May as the same drought which is not the case.

I tried then use dplyr and group_by to make the function run for one year at the time

  group_by(year) %>%
  mutate(group = rep(seq_along(z<-rle(myData$day_status)$lengths),z)) %>%
  ungroup() %>%
  {. ->> myData}

but this gives an error Error: Column group must be length 6 (the group size) or one, not 21. I gathered this has something to do with how the group_by works, but I don't fully understand what is the problem. Any help is greatly appreciated!

LaHN
  • 47
  • 7

2 Answers2

0

For such cases I make use of rle:

rleLengths <- rle(paste0(myData$year, myData$day_status))$lengths


myData <- myData %>%
  mutate(group = rep(1:length(rleLengths), rleLengths)

myData$group

[1] 1 1 1 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 4 4
Martin Schmelzer
  • 23,283
  • 6
  • 73
  • 98
  • Thank you! This worked perfectly. Just a question, why does rleLengths need to be defined separately? – LaHN Aug 04 '20 at 09:38
  • It does not have to be defined separatley. You could call it again and again within the mutation, but that would be redundant. – Martin Schmelzer Aug 04 '20 at 10:54
0

You can use cur_group_id in dplyr 1.0.0

library(dplyr)
myData %>% group_by(year, day_status) %>% mutate(group = cur_group_id()) 

If you want the groups to be created as they appear a base R option is :

myData <- transform(myData, group = paste0(year, day_status))
transform(myData, group = match(group, unique(group)))


#   series       date year day_status group
#1      FS 2016-10-26 2016     normal     1
#2      FS 2016-10-27 2016     normal     1
#3      FS 2016-10-28 2016     normal     1
#4      FS 2016-10-29 2016    drought     2
#5      FS 2016-10-30 2016    drought     2
#6      FS 2016-10-31 2016    drought     2
#7      FS 2017-05-01 2017    drought     3
#8      FS 2017-05-02 2017    drought     3
#9      FS 2017-05-03 2017    drought     3
#10     FS 2017-05-04 2017    drought     3
#11     FS 2017-05-05 2017    drought     3
#12     FS 2017-05-06 2017    drought     3
#13     FS 2017-05-07 2017    drought     3
#14     FS 2017-05-08 2017    drought     3
#15     FS 2017-05-09 2017    drought     3
#16     FS 2017-05-10 2017    drought     3
#17     FS 2017-05-11 2017    drought     3
#18     FS 2017-05-12 2017    drought     3
#19     FS 2017-05-13 2017    drought     3
#20     FS 2017-05-14 2017     normal     4
#21     FS 2017-05-15 2017     normal     4
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
  • Hmmm, I tried using the cur_group_id() but I think I did not do it correctly. It always gave the same number for all the normal periods and drought periods during the same year. – LaHN Aug 04 '20 at 09:41
  • That's weird. It gives me 4 groups for your data. Did you do `group_by` first? Did the base R option work? – Ronak Shah Aug 04 '20 at 09:44
  • I tried using it with my whole data, so maybe that's causes the problem. As I have several series, I used group_by(series, year, day_status). But even with the simplified data the group order is a little funny for me: it's 2,2,2,1,1,1,3,3,3. The base R option worked perfectly! – LaHN Aug 04 '20 at 10:06