How to create group indices for nested groups in r

Question

I have a dataset with multiple observations nested within individuals. This example dataset includes columns for id and for day of the week (dayweek, 1-7). I have observations from 3 days from each individual. So one individual might have only submitted reports for Sun/Wed/Thu (1, 4, 5), and the other might have submitted reports for Sun/Mon/Tue (1, 2, 3), as in this example:

df <- data.frame(
  id = c(rep(1:2, each = 6),2),
  dayweek = c(rep(c(1, 4, 5), each = 2),rep(c(1, 2, 3), each = 2), 3)
)

I want to set up a column that marks each individual's first, second, and third day, like this:

df2 <- data.frame(
  id = c(rep(1:2, each = 6),2),
  dayweek = c(rep(c(1, 4, 5), each = 2),rep(c(1, 2, 3), each = 2), 3),
  daynum = c(rep(1:3, each = 2, times = 2), 3)
)

I tried using

df %>% group_indices(id, dayweek)

but this produces a new id for each individual-day combination. What's a good way to do this?

Thanks in advance!

What if the days were Friday, Saturday, Sunday: then you’d have 6,7,1 but the 6 would be the first day and the 1 would be the third, right? Are the rows already in date order, so that the first row for an id would get daynum = 1? — divibisan, Apr 21 '19 at 03:50
Hi, correct: in this case I would like to recode 6 to 1, 7 to 2, and 1 to 3. Yes, the rows are in order but there is a different number of observations for each individual and each day. So they might have 4 observations for Sat, 2 for Sun, etc. — Aziggy, Apr 21 '19 at 03:52

cropgen · Accepted Answer · 2019-04-21T04:11:50.537

`dplyr`

Using cumsum and !duplicated with dplyr

df %>%
  group_by(id) %>%
  mutate(daynum = cumsum(!duplicated(dayweek)))


# A tibble: 13 x 3
# Groups:   id [2]
      id dayweek daynum
   <dbl>   <dbl>  <int>
 1     1       1      1
 2     1       1      1
 3     1       4      2
 4     1       4      2
 5     1       5      3
 6     1       5      3
 7     2       1      1
 8     2       1      1
 9     2       2      2
10     2       2      2
11     2       3      3
12     2       3      3
13     2       3      3

`tapply` from base `R`

unlist(tapply(df$dayweek, df$id, function(x) cumsum(!duplicated(x))))

 1  1  2  2  3  3  1  1  2  2  3  3  3

This handles also the "Friday, Saturday, Sunday" case (`dayweek` 6, 7, 1) well. — Uwe, Apr 21 '19 at 07:42

Ronak Shah · Answer 2 · 2019-04-21T08:11:26.687

4

We could group_by id and create an unique id for each dayweek

library(dplyr)

df %>%
  group_by(id) %>%
  mutate(daynum = as.integer(factor(dayweek, levels = unique(dayweek))))

#      id dayweek daynum
#   <dbl>   <dbl>  <int>
# 1     1       1      1
# 2     1       1      1
# 3     1       4      2
# 4     1       4      2
# 5     1       5      3
# 6     1       5      3
# 7     2       1      1
# 8     2       1      1
# 9     2       2      2
#10     2       2      2
#11     2       3      3
#12     2       3      3
#13     2       3      3

In base R we can use ave for the same

with(df, ave(dayweek, id, FUN = function(x) 
         as.integer(factor(x, levels = unique(x)))))
#[1] 1 1 2 2 3 3 1 1 2 2 3 3 3

edited Apr 21 '19 at 08:11

answered Apr 21 '19 at 04:00

Ronak Shah

377,200
20
156
213

For the "Friday, Saturday, Sunday" case (`dayweek` 6, 7, 1), this will return 2, 3, 1 while the OP expects 1, 2, 3 according to the comments. – Uwe Apr 21 '19 at 07:40
@Uwe Thanks, updated the answer to handle that case. – Ronak Shah Apr 21 '19 at 08:11
Interesting to see the use of `unique`. Somehow the `!duplicated` can easily be taken to mean `unique` but applying that becomes tricky. – NelsonGon Jun 21 '19 at 12:27
1

@NelsonGon Yes, here the OP wanted to follow the correct order based on when the `dayweek` was observed hence, the use of `unique`. – Ronak Shah Jun 21 '19 at 12:33

score 3 · Answer 3 · edited Jun 20 '20 at 09:12

According to OP's comment, the rows are in order.

Then, here are two different approaches which also will handle the "Friday, Saturday, Sunday" case (dayweek 6, 7, 1) mentioned in the comments.

rleid()
fct_inorder()

`rleid()`

This uses the rleid() function from the data.table package:

library(dplyr)
df2 %>% 
  group_by(id) %>% 
  mutate(daynum2 = data.table::rleid(dayweek))

      id dayweek daynum daynum2
   <dbl>   <dbl>  <dbl>   <int>
 1     1       1      1       1
 2     1       1      1       1
 3     1       4      2       2
 4     1       4      2       2
 5     1       5      3       3
 6     1       5      3       3
 7     2       1      1       1
 8     2       1      1       1
 9     2       2      2       2
10     2       2      2       2
11     2       3      3       3
12     2       3      3       3
13     2       3      3       3
14     3       6      1       1
15     3       7      2       2
16     3       1      3       3

Note that an extended data set is used which also covers the "Friday, Saturday, Sunday" case (dayweek 6, 7, 1).

`fct_inorder()`

This is an enhanced version of Ronak's answer which handles also the "Friday, Saturday, Sunday" case. It uses the fct_inorder() from the forcats package which reorders factor levels by first appearance.

df2 %>% 
  group_by(id) %>% 
  mutate(daynum2 = 
           dayweek %>% 
           as.character() %>% 
           forcats::fct_inorder() %>% 
           as.integer()
         )

The output is the same as above.

Data

This is an extended data set which includes also the "Friday, Saturday, Sunday" case (dayweek 6, 7, 1):

df2 <- data.frame(
  id = c(rep(1:2, each = 6), 2, rep(3, 3)),
  dayweek = c(rep(c(1, 4, 5), each = 2),rep(c(1, 2, 3), each = 2), 3, 6, 7, 1),
  daynum = c(rep(1:3, each = 2, times = 2), 3, 1:3)
)

How to create group indices for nested groups in r

3 Answers3

`dplyr`

`tapply` from base `R`

`rleid()`

`fct_inorder()`

Data

Linked

Related

How to create group indices for nested groups in r

3 Answers3

dplyr

tapply from base R

rleid()

fct_inorder()

Data

Linked

Related

`dplyr`

`tapply` from base `R`

`rleid()`

`fct_inorder()`