9

I'm looking for an efficient way to identify spells/runs in a time series. In the image below, the first three columns is what I have, the fourth column, spell is what I'm trying to compute. I've tried using dplyr's lead and lag, but that gets too complicated. I've tried rle but got nowhere.

enter image description here

ReprEx

df <- structure(list(time = structure(c(1538876340, 1538876400, 
1538876460,1538876520, 1538876580, 1538876640, 1538876700, 1538876760, 1526824800, 
1526824860, 1526824920, 1526824980, 1526825040, 1526825100), class = c("POSIXct", 
"POSIXt"), tzone = "UTC"), group = c("A", "A", "A", "A", "A", "A", "A", "A", "B", 
"B", "B", "B", "B", "B"), is.5 = c(0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1)), 
class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -14L))

I prefer a tidyverse solution.

Assumptions

  1. Data is sorted by group and then by time

  2. There are no gaps in time within each group


Update

Thanks for the contributions. I've timed some of the proposed approaches on the full data (n=2,583,360)

  1. the rle approach by @markus took 0.53 seconds
  2. the cumsum approach by @M-M took 2.85 seconds
  3. the function approach by @MrFlick took 0.66 seconds
  4. the rle and dense_rank by @tmfmnk took 0.89

I ended up choosing (1) by @markus because it's fast and still somewhat intuitive (subjective). (2) by @M-M best satisfied my desire for a dplyr solution, though it is computationally inefficient.

Community
  • 1
  • 1
Thomas Speidel
  • 1,369
  • 1
  • 14
  • 26

6 Answers6

8

One option using rle

library(dplyr)
df %>% 
  group_by(group) %>% 
  mutate(
    spell = {
      r <- rle(is.5)
      r$values <- cumsum(r$values) * r$values
      inverse.rle(r) 
      }
  )
# A tibble: 14 x 4
# Groups:   group [2]
#   time                group  is.5 spell
#   <dttm>              <chr> <dbl> <dbl>
# 1 2018-10-07 01:39:00 A         0     0
# 2 2018-10-07 01:40:00 A         1     1
# 3 2018-10-07 01:41:00 A         1     1
# 4 2018-10-07 01:42:00 A         0     0
# 5 2018-10-07 01:43:00 A         1     2
# 6 2018-10-07 01:44:00 A         0     0
# 7 2018-10-07 01:45:00 A         0     0
# 8 2018-10-07 01:46:00 A         1     3
# 9 2018-05-20 14:00:00 B         0     0
#10 2018-05-20 14:01:00 B         0     0
#11 2018-05-20 14:02:00 B         1     1
#12 2018-05-20 14:03:00 B         1     1
#13 2018-05-20 14:04:00 B         0     0
#14 2018-05-20 14:05:00 B         1     2

You asked for a tidyverse solution but if speed is your concern, you might use data.table. The syntax is very similar

library(data.table)
setDT(df)[, spell := {
  r <- rle(is.5)
  r$values <- cumsum(r$values) * r$values
  inverse.rle(r) 
  }, by = group][] # the [] at the end prints the data.table

explanation

When we call

r <- rle(df$is.5)

the result we get is

r
#Run Length Encoding
#  lengths: int [1:10] 1 2 1 1 2 1 2 2 1 1
#  values : num [1:10] 0 1 0 1 0 1 0 1 0 1

We need to replace values with the cumulative sum where values == 1 while values should remain zero otherwise.

We can achieve this when we multiple cumsum(r$values) with r$values; where the latter is a vector of 0s and 1s.

r$values <- cumsum(r$values) * r$values
r$values
# [1] 0 1 0 2 0 3 0 4 0 5

Finally we call inverse.rle to get back a vector of the same length as is.5.

inverse.rle(r)
# [1] 0 1 1 0 2 0 0 3 0 0 4 4 0 5

We do this for every group.

markus
  • 25,843
  • 5
  • 39
  • 58
6

Here's a helper function that can return what you are after

spell_index <- function(time, flag) {
  change <- time-lag(time)==1 & flag==1 & lag(flag)!=1
  cumsum(change) * (flag==1)+0
}

And you can use it with your data like

library(dplyr)
df %>% 
  group_by(group) %>% 
  mutate(
    spell = spell_index(time, is.5)
  )

Basically the helper functions uses lag() to look for changes. We use cumsum() to increment the number of changes. Then we multiply by a boolean value so zero-out the values you want to be zeroed out.

MrFlick
  • 195,160
  • 17
  • 277
  • 295
3

Here is one option with rleid from data.table. Convert the 'data.frame' to 'data.table' (setDT(df)), grouped by 'group', get the run-length-id (rleid) of 'is.5' and multiply with the values of 'is.5' so as to replace the ids corresponding to 0s in is.5 to 0, assign it to 'spell', then specify the i with a logical vector to select rows that have 'spell' values not zero, match those values of 'spell' with unique 'spell' and assign it to 'spell'

library(data.table)
setDT(df)[, spell := rleid(is.5) * as.integer(is.5), group
       ][!!spell, spell := match(spell, unique(spell))][]
#                   time group is.5 spell
# 1: 2018-10-07 01:39:00     A    0     0
# 2: 2018-10-07 01:40:00     A    1     1
# 3: 2018-10-07 01:41:00     A    1     1
# 4: 2018-10-07 01:42:00     A    0     0
# 5: 2018-10-07 01:43:00     A    1     2
# 6: 2018-10-07 01:44:00     A    0     0
# 7: 2018-10-07 01:45:00     A    0     0
# 8: 2018-10-07 01:46:00     A    1     3
# 9: 2018-05-20 14:00:00     B    0     0
#10: 2018-05-20 14:01:00     B    0     0
#11: 2018-05-20 14:02:00     B    1     1
#12: 2018-05-20 14:03:00     B    1     1
#13: 2018-05-20 14:04:00     B    0     0
#14: 2018-05-20 14:05:00     B    1     2

Or after the first step, use .GRP

df[!!spell, spell := .GRP, spell]
akrun
  • 874,273
  • 37
  • 540
  • 662
1

This works,

The data,

df <- structure(list(time = structure(c(1538876340, 1538876400, 1538876460,1538876520, 1538876580, 1538876640, 1538876700, 1538876760, 1526824800, 1526824860, 1526824920, 1526824980, 1526825040, 1526825100), class = c("POSIXct", "POSIXt"), tzone = "UTC"), group = c("A", "A", "A", "A", "A", "A", "A", "A", "B", "B", "B", "B", "B", "B"), is.5 = c(0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -14L))

We split our data by group,

df2 <- split(df, df$group)

Build a function we can apply to the list,

my_func <- function(dat){
  rst <- dat %>% 
    mutate(change = diff(c(0,is.5))) %>% 
    mutate(flag = change*abs(is.5)) %>% 
    mutate(spell = ifelse(is.5 == 0 | change == -1, 0, cumsum(flag))) %>% 
    dplyr::select(time, group, is.5, spell)
  return(rst)
}

Then apply it,

l <- lapply(df2, my_func)

We can now turn this list back into a data frame:

do.call(rbind.data.frame, l)
Hector Haffenden
  • 1,360
  • 10
  • 25
1

A somehow different possibility (not involving cumsum()) could be:

df %>%
 group_by(group) %>%
 mutate(spell = with(rle(is.5), rep(seq_along(lengths), lengths))) %>%
 group_by(group, is.5) %>%
 mutate(spell = dense_rank(spell)) %>%
 ungroup() %>%
 mutate(spell = ifelse(is.5 == 0, 0, spell))

   time                group  is.5 spell
   <dttm>              <chr> <dbl> <dbl>
 1 2018-10-07 01:39:00 A         0     0
 2 2018-10-07 01:40:00 A         1     1
 3 2018-10-07 01:41:00 A         1     1
 4 2018-10-07 01:42:00 A         0     0
 5 2018-10-07 01:43:00 A         1     2
 6 2018-10-07 01:44:00 A         0     0
 7 2018-10-07 01:45:00 A         0     0
 8 2018-10-07 01:46:00 A         1     3
 9 2018-05-20 14:00:00 B         0     0
10 2018-05-20 14:01:00 B         0     0
11 2018-05-20 14:02:00 B         1     1
12 2018-05-20 14:03:00 B         1     1
13 2018-05-20 14:04:00 B         0     0
14 2018-05-20 14:05:00 B         1     2

Here it, first, groups by "group" and then gets the run-length-ID of "is.5". Second, it groups by "group" and "is.5" and ranks the values on the run-length-ID. Finally, it assigns 0 to rows where "is.5" == 0.

tmfmnk
  • 38,881
  • 4
  • 47
  • 67
1

One options is using cumsum:

library(dplyr)
df %>% group_by(group) %>%  arrange(group, time) %>% 
   mutate(spell = is.5 * cumsum( c(0,lag(is.5)[-1]) != is.5 & is.5!=0) )


# # A tibble: 14 x 4
# # Groups:   group [2]
#   time                  group     is.5   spell
#   <dttm>                <chr>     <dbl>  <dbl>
# 1 2018-10-07 01:39:00   A         0      0
# 2 2018-10-07 01:40:00   A         1      1
# 3 2018-10-07 01:41:00   A         1      1
# 4 2018-10-07 01:42:00   A         0      0
# 5 2018-10-07 01:43:00   A         1      2
# 6 2018-10-07 01:44:00   A         0      0
# 7 2018-10-07 01:45:00   A         0      0
# 8 2018-10-07 01:46:00   A         1      3
# 9 2018-05-20 14:00:00   B         0      0
# 10 2018-05-20 14:01:00  B         0      0
# 11 2018-05-20 14:02:00  B         1      1
# 12 2018-05-20 14:03:00  B         1      1
# 13 2018-05-20 14:04:00  B         0      0
# 14 2018-05-20 14:05:00  B         1      2

c(0,lag(is.5)[-1]) != is.5 this takes care of assigning a new id (i.e. spell) whenever is.5 changes; but we want to avoid assigning new ones to those rows is.5 equal to 0 and that's why I have the second rule in cumsum function (i.e. (is.5!=0)).

However, that second rule only prevents assigning a new id (adding 1 to the previous id) but it won't set the id to 0. That's why I have multiplied the answer by is.5.

M--
  • 25,431
  • 8
  • 61
  • 93