3

I have a dataset with multiple observations nested within individuals. This example dataset includes columns for id and for day of the week (dayweek, 1-7). I have observations from 3 days from each individual. So one individual might have only submitted reports for Sun/Wed/Thu (1, 4, 5), and the other might have submitted reports for Sun/Mon/Tue (1, 2, 3), as in this example:

df <- data.frame(
  id = c(rep(1:2, each = 6),2),
  dayweek = c(rep(c(1, 4, 5), each = 2),rep(c(1, 2, 3), each = 2), 3)
)

I want to set up a column that marks each individual's first, second, and third day, like this:

df2 <- data.frame(
  id = c(rep(1:2, each = 6),2),
  dayweek = c(rep(c(1, 4, 5), each = 2),rep(c(1, 2, 3), each = 2), 3),
  daynum = c(rep(1:3, each = 2, times = 2), 3)
)

I tried using

df %>% group_indices(id, dayweek) 

but this produces a new id for each individual-day combination. What's a good way to do this?

Thanks in advance!

Aziggy
  • 99
  • 2
  • 7
  • 1
    What if the days were Friday, Saturday, Sunday: then you’d have 6,7,1 but the 6 would be the first day and the 1 would be the third, right? Are the rows already in date order, so that the first row for an id would get daynum = 1? – divibisan Apr 21 '19 at 03:50
  • Hi, correct: in this case I would like to recode 6 to 1, 7 to 2, and 1 to 3. Yes, the rows are in order but there is a different number of observations for each individual and each day. So they might have 4 observations for Sat, 2 for Sun, etc. – Aziggy Apr 21 '19 at 03:52

3 Answers3

6

dplyr

Using cumsum and !duplicated with dplyr

df %>%
  group_by(id) %>%
  mutate(daynum = cumsum(!duplicated(dayweek)))


# A tibble: 13 x 3
# Groups:   id [2]
      id dayweek daynum
   <dbl>   <dbl>  <int>
 1     1       1      1
 2     1       1      1
 3     1       4      2
 4     1       4      2
 5     1       5      3
 6     1       5      3
 7     2       1      1
 8     2       1      1
 9     2       2      2
10     2       2      2
11     2       3      3
12     2       3      3
13     2       3      3

tapply from base R

unlist(tapply(df$dayweek, df$id, function(x) cumsum(!duplicated(x))))

 1  1  2  2  3  3  1  1  2  2  3  3  3 
cropgen
  • 1,920
  • 15
  • 24
  • This handles also the "Friday, Saturday, Sunday" case (`dayweek` 6, 7, 1) well. – Uwe Apr 21 '19 at 07:42
4

We could group_by id and create an unique id for each dayweek

library(dplyr)

df %>%
  group_by(id) %>%
  mutate(daynum = as.integer(factor(dayweek, levels = unique(dayweek))))

#      id dayweek daynum
#   <dbl>   <dbl>  <int>
# 1     1       1      1
# 2     1       1      1
# 3     1       4      2
# 4     1       4      2
# 5     1       5      3
# 6     1       5      3
# 7     2       1      1
# 8     2       1      1
# 9     2       2      2
#10     2       2      2
#11     2       3      3
#12     2       3      3
#13     2       3      3

In base R we can use ave for the same

with(df, ave(dayweek, id, FUN = function(x) 
         as.integer(factor(x, levels = unique(x)))))
#[1] 1 1 2 2 3 3 1 1 2 2 3 3 3
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
  • For the "Friday, Saturday, Sunday" case (`dayweek` 6, 7, 1), this will return 2, 3, 1 while the OP expects 1, 2, 3 according to the comments. – Uwe Apr 21 '19 at 07:40
  • @Uwe Thanks, updated the answer to handle that case. – Ronak Shah Apr 21 '19 at 08:11
  • Interesting to see the use of `unique`. Somehow the `!duplicated` can easily be taken to mean `unique` but applying that becomes tricky. – NelsonGon Jun 21 '19 at 12:27
  • 1
    @NelsonGon Yes, here the OP wanted to follow the correct order based on when the `dayweek` was observed hence, the use of `unique`. – Ronak Shah Jun 21 '19 at 12:33
3

According to OP's comment, the rows are in order.

Then, here are two different approaches which also will handle the "Friday, Saturday, Sunday" case (dayweek 6, 7, 1) mentioned in the comments.

  1. rleid()
  2. fct_inorder()

rleid()

This uses the rleid() function from the data.table package:

library(dplyr)
df2 %>% 
  group_by(id) %>% 
  mutate(daynum2 = data.table::rleid(dayweek)) 
      id dayweek daynum daynum2
   <dbl>   <dbl>  <dbl>   <int>
 1     1       1      1       1
 2     1       1      1       1
 3     1       4      2       2
 4     1       4      2       2
 5     1       5      3       3
 6     1       5      3       3
 7     2       1      1       1
 8     2       1      1       1
 9     2       2      2       2
10     2       2      2       2
11     2       3      3       3
12     2       3      3       3
13     2       3      3       3
14     3       6      1       1
15     3       7      2       2
16     3       1      3       3

Note that an extended data set is used which also covers the "Friday, Saturday, Sunday" case (dayweek 6, 7, 1).

fct_inorder()

This is an enhanced version of Ronak's answer which handles also the "Friday, Saturday, Sunday" case. It uses the fct_inorder() from the forcats package which reorders factor levels by first appearance.

df2 %>% 
  group_by(id) %>% 
  mutate(daynum2 = 
           dayweek %>% 
           as.character() %>% 
           forcats::fct_inorder() %>% 
           as.integer()
         ) 

The output is the same as above.

Data

This is an extended data set which includes also the "Friday, Saturday, Sunday" case (dayweek 6, 7, 1):

df2 <- data.frame(
  id = c(rep(1:2, each = 6), 2, rep(3, 3)),
  dayweek = c(rep(c(1, 4, 5), each = 2),rep(c(1, 2, 3), each = 2), 3, 6, 7, 1),
  daynum = c(rep(1:3, each = 2, times = 2), 3, 1:3)
)
Community
  • 1
  • 1
Uwe
  • 41,420
  • 11
  • 90
  • 134