2

I came across a problem that forced me to use a loop instead of my preferred dplyr pipe flow.

I want to group rows based on consecutive observations of the same value. For example, if the first four observations of type equal a, the first four observations should assigned to the same group. Order matters, so I can't dplyr::group_by and dplyr::summarize.

The code below should explain the problem fairly well. I was wondering if anyone could propose a less verbose way to do this, preferably using tidyverse packages, and not data.tables.

library(tidyverse)

# Crete some test data
df <- tibble(
  id = 1:20,
  type = c(rep("a", 5), rep("b", 5), rep("a", 5), rep("b", 5)),
  val = runif(20)
)

df
#> # A tibble: 20 x 3
#>       id type     val
#>    <int> <chr>  <dbl>
#>  1     1 a     0.0606
#>  2     2 a     0.501 
#>  3     3 a     0.974 
#>  4     4 a     0.0833
#>  5     5 a     0.752 
#>  6     6 b     0.0450
#>  7     7 b     0.367 
#>  8     8 b     0.649 
#>  9     9 b     0.846 
#> 10    10 b     0.896 
#> 11    11 a     0.178 
#> 12    12 a     0.295 
#> 13    13 a     0.206 
#> 14    14 a     0.233 
#> 15    15 a     0.851 
#> 16    16 b     0.179 
#> 17    17 b     0.801 
#> 18    18 b     0.326 
#> 19    19 b     0.269 
#> 20    20 b     0.584

# Solve problem with a loop
count <- 1
df$consec_group <- NA
for (i in 1:nrow(df)) {
  current <- df$type[i]
  lag <- ifelse(i == 1, NA, df$type[i - 1])
  lead <- ifelse(i == nrow(df), NA, df$type[i + 1])

  if (lead %>% is.na) {
    df$consec_group[i] <- ifelse(current == lag, count, count + 1) 
  } else {
    df$consec_group[i] <- count 
    if (current != lead) count <- count + 1
  }
}

df
#> # A tibble: 20 x 4
#>       id type     val consec_group
#>    <int> <chr>  <dbl>        <dbl>
#>  1     1 a     0.0606            1
#>  2     2 a     0.501             1
#>  3     3 a     0.974             1
#>  4     4 a     0.0833            1
#>  5     5 a     0.752             1
#>  6     6 b     0.0450            2
#>  7     7 b     0.367             2
#>  8     8 b     0.649             2
#>  9     9 b     0.846             2
#> 10    10 b     0.896             2
#> 11    11 a     0.178             3
#> 12    12 a     0.295             3
#> 13    13 a     0.206             3
#> 14    14 a     0.233             3
#> 15    15 a     0.851             3
#> 16    16 b     0.179             4
#> 17    17 b     0.801             4
#> 18    18 b     0.326             4
#> 19    19 b     0.269             4
#> 20    20 b     0.584             4

Created on 2019-03-14 by the reprex package (v0.2.1)

This grouping of consecutive type occurrences is really just an intermediate step. My endgame is manipulate val for a given consec_group, based on the values of val that occurred within the previous consec_group. Advice on relevant packages would be appreciated.

Cœur
  • 37,241
  • 25
  • 195
  • 267
djfinnoy
  • 585
  • 3
  • 13
  • 2
    `with(rle(df$type), rep(seq_along(lengths), lengths))` – d.b Mar 14 '19 at 18:01
  • Here's one post that this is close to a dupe of: https://stackoverflow.com/q/47169195/5325862 – camille Mar 14 '19 at 18:02
  • 1
    @camille That is pretty close, but the question and answers here are much clearer. Maybe close that question as a dupe of this one? – divibisan Mar 14 '19 at 19:41

2 Answers2

5

You say "no data.tables", but are you sure? It's so *** fast and easy (in this case)...

library(data.table)
setDT(df)[, groupid := rleid(type)][]

#     id type         val groupid
#  1:  1    a 0.624078793       1
#  2:  2    a 0.687361541       1
#  3:  3    a 0.817702740       1
#  4:  4    a 0.669857208       1
#  5:  5    a 0.100977936       1
#  6:  6    b 0.418275823       2
#  7:  7    b 0.660119857       2
#  8:  8    b 0.876015209       2
#  9:  9    b 0.473562143       2
# 10: 10    b 0.284474633       2
# 11: 11    a 0.034154862       3
# 12: 12    a 0.391760387       3
# 13: 13    a 0.383107868       3
# 14: 14    a 0.729583433       3
# 15: 15    a 0.006288375       3
# 16: 16    b 0.530179235       4
# 17: 17    b 0.802643704       4
# 18: 18    b 0.409618633       4
# 19: 19    b 0.309363642       4
# 20: 20    b 0.021918512       4

If you insist on using the tidyverse/dplyr, you can (of course) still use the rleid-function as follows:

df %>% mutate( groupid = data.table::rleid(type) )

benchmarks

on a larger sample

library(tidyverse)
library(data.table)

# Crete some large test data
df <- tibble(
  id = 1:200000,
  type = sample(letters[1:26], 200000, replace = TRUE),
  val = runif(200000)
)

dt <- as.data.table(df)

microbenchmark::microbenchmark(
  dplyr.rleid      = df %>% mutate( groupid = data.table::rleid(type) ),
  data.table.rleid = dt[, groupid := rleid(type)][], 
  rle = df %>% mutate(ID_rleid = {ID_rleid = rle(type); rep(seq_along(ID_rleid$lengths), ID_rleid$lengths)}),
  rle2 = df %>% mutate(ID_rleid = with(rle(type), rep(seq_along(lengths), lengths))),
  transform = transform(df, ID = with(rle(df$type), rep(seq_along(lengths), lengths))),
  times = 10)

# Unit: milliseconds
#             expr       min        lq      mean    median        uq        max neval
#      dplyr.rleid  3.153626  3.278049  3.410363  3.444949  3.502792   3.582626    10
# data.table.rleid  2.965639  3.065959  3.173992  3.145643  3.259672   3.507009    10
#              rle 13.059774 14.042797 24.364176 26.126176 29.460561  36.874054    10
#             rle2 12.641319 13.553846 30.951152 24.698338 34.139786 102.791719    10
#        transform 12.330717 22.419128 22.725242 25.532084 26.187634  26.702794    10
Wimpel
  • 26,031
  • 1
  • 20
  • 37
  • You make a persuasive argument. My reasoning was: a) I haven't invested time in learning data.tables yet, because I naively thought the tidyverse covered my needs (you have might just convinced me otherwise). b) im doing exploratory analysis, and a colleague will inevitably be translating my work into python one day; I wanted to keep things easy to understand, and my impression is that DT is harder to read than dplyr. – djfinnoy Mar 14 '19 at 18:15
  • Just add enough comments (as you would (likely) also do when using the tidyverse ). – Wimpel Mar 14 '19 at 18:38
  • 1
    Thanks for posting this: even if it's not what the solution the asker prefers, it will be helpful for future people with the same question! – divibisan Mar 14 '19 at 18:42
  • 2
    @djfinnoy see bottom of answer for a sort-of-dplyr solution ;-) – Wimpel Mar 14 '19 at 18:50
  • Thanks! although I said no data.tables, i just love the elegance; i'll use the extra space for comments – djfinnoy Mar 14 '19 at 18:53
  • It's interesting to see that `rleid()` in `dplyr` is almost as fast (or should we say as slow?) as the solutions based on `rle()`. – tmfmnk Mar 14 '19 at 18:59
  • If you get the chance, also test `transform(df, ID = with(rle(df$type), rep(seq_along(lengths), lengths)))` – d.b Mar 14 '19 at 19:43
  • @tmfmnk probabaly because of the small sample set.. see updates benchmarks with a df of 200,000 rows. – Wimpel Mar 15 '19 at 06:59
  • @d.b updated the benchmarks with `transform`-solution – Wimpel Mar 15 '19 at 07:00
  • This is what I would expect. Thanks for the benchmarking! – tmfmnk Mar 15 '19 at 07:02
3

You can use a rleid()-like possibility like this:

df %>%
 mutate(ID_rleid = {ID_rleid = rle(type); rep(seq_along(ID_rleid$lengths), ID_rleid$lengths)})

      id type     val ID_rleid
   <int> <chr>  <dbl>    <int>
 1     1 a     0.0430        1
 2     2 a     0.858         1
 3     3 a     0.504         1
 4     4 a     0.318         1
 5     5 a     0.469         1
 6     6 b     0.144         2
 7     7 b     0.173         2
 8     8 b     0.0706        2
 9     9 b     0.958         2
10    10 b     0.557         2
11    11 a     0.358         3
12    12 a     0.973         3
13    13 a     0.982         3
14    14 a     0.177         3
15    15 a     0.599         3
16    16 b     0.627         4
17    17 b     0.454         4
18    18 b     0.682         4
19    19 b     0.690         4
20    20 b     0.713         4

Or a modification (originally proposed by @d.b) that makes it more handy:

df %>%
 mutate(ID_rleid = with(rle(type), rep(seq_along(lengths), lengths)))
tmfmnk
  • 38,881
  • 4
  • 47
  • 67
  • @divibisan I haven't seen that comment before posting my own solution, but it is really nice. I added it to my post. – tmfmnk Mar 14 '19 at 18:47