Cumulative count for interspersed categorical values in a dataframe, independently for each id

Question

My data has a column that contains redundant categorical values that are interspersed. I would like to indicate, in each row, the ith time each unique value appears. To add complexity, I have different ids in my dataframe, and the count has to be independent for each id.

Dummy Version of My Data

set.seed(123)
fruits <- sample(c("apple", "banana", "orange"), 30, replace = TRUE)
id <- c(rep(1, 10), rep(2, 10), rep(3, 10))
df <- as.data.frame(cbind(id, fruits))

> df
   id fruits
1   1 orange
2   1 orange
3   1 orange
4   1 banana
5   1 orange
6   1 banana
7   1 banana
8   1 banana
9   1 orange
10  1  apple
11  2 banana
12  2 banana
13  2  apple
14  2 banana
15  2 orange
16  2  apple
17  2 orange
18  2 orange
19  2  apple
20  2  apple
21  3  apple
22  3  apple
23  3 orange
24  3 banana
25  3 orange
26  3 banana
27  3  apple
28  3 banana
29  3 orange
30  3 banana

The Output I'm Looking For

> df
   id fruits   fruit_repetitions_per_id  
1   1 orange   1
2   1 orange   2
3   1 orange   3
4   1 banana   1
5   1 orange   4
6   1 banana   2
7   1 banana   3
8   1 banana   4
9   1 orange   5
10  1  apple   1
11  2 banana   1
12  2 banana   2
13  2  apple   1
14  2 banana   3
15  2 orange   1
16  2  apple   2
17  2 orange   2
18  2 orange   3
19  2  apple   3
20  2  apple   4
21  3  apple   1
22  3  apple   2
23  3 orange   1
24  3 banana   1
25  3 orange   2
26  3 banana   2
27  3  apple   3
28  3 banana   3
29  3 orange   3
30  3 banana   4

Attempts to Solve the Problem

This one is pretty much what I want, but I have my additional need to count/mark separately for each id, which that solution doesn't address.
This one is exactly what I need, but couldn't make it work and got a bunch of NAs instead:

with(df, ave(fruits, id,
             FUN = function(x) cumsum(!duplicated(x))))

[1] <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
Levels: apple banana orange
Warning messages:
1: In `[<-.factor`(`*tmp*`, i, value = c(1L, 1L, 1L, 2L, 2L, 2L, 2L,  :
  invalid factor level, NA generated
2: In `[<-.factor`(`*tmp*`, i, value = c(1L, 1L, 2L, 2L, 3L, 3L, 3L,  :
  invalid factor level, NA generated
3: In `[<-.factor`(`*tmp*`, i, value = c(1L, 1L, 2L, 3L, 3L, 3L, 3L,  :
  invalid factor level, NA generated

Any ideas?

Thanks!

@jogo, your answer is the simplest and most concise. Make it an answer. — Emman, Jul 23 '19 at 12:22
The question is closed because it is a duplicate (no new answer is possible). A similar answer is for the question https://stackoverflow.com/questions/12925063/numbering-rows-within-groups-in-a-data-frame — jogo, Jul 23 '19 at 12:33
@jogo, from testing your solution, there's seems to be a glitch when it comes to NA values. That is, if `df$fruits` contains some `NA` values for the same `df$id`, `ave()` would mark both NAs as the first occurrence. This unwanted behavior is avoided when using @IceCreamToucan's or @akrun's solutions proposed here. I'm not sure why NA breaks `ave()` but not the other methods. Ideas? — Emman, Aug 13 '19 at 05:38

score 2 · Answer 1 · answered Jul 22 '19 at 13:08

2

You can use data.table::rowid

library(data.table)
setDT(df)

df[, new_col := rowid(id, fruits)]

answered Jul 22 '19 at 13:08

IceCreamToucan

28,083
2
22
38

score 1 · Answer 2 · answered Jul 22 '19 at 13:05

1

An option would be

library(dplyr)
df %>%
    group_by(id, fruits) %>%
    mutate(fruits_rep = row_number())

answered Jul 22 '19 at 13:05

akrun

874,273
37
540
662

Thanks, @akrun. In my real dataset, I'm trying to use `mutate_at()` instead of `mutate()`, to update an existing column. However, I'm not sure how to write the syntax correctly. I've tried wrapping with `vars()` such as `mutate_at(vars(fruits) = row_number())`, but it didn't work. Nor did `mutate_at(vars(fruits) , row_number())`. Any idea? Thx – Emman Aug 13 '19 at 06:15
@Emman Would this be `mutate(fruits = row_number())` – akrun Aug 13 '19 at 13:00

score 1 · Answer 3 · answered Jul 22 '19 at 13:08

You can do this in base as you have already tried with ave:

df$fruit_repetitions_per_id  <- ave(rep(1,nrow(df)), df[c("id", "fruits")], FUN=cumsum)
df
#   id fruits fruit_repetitions_per_id
#1   1 orange                        1
#2   1 orange                        2
#3   1 orange                        3
#4   1 banana                        1
#5   1 orange                        4
#6   1 banana                        2
#7   1 banana                        3
#8   1 banana                        4
#9   1 orange                        5
#10  1  apple                        1
#11  2 banana                        1
#12  2 banana                        2
#13  2  apple                        1
#14  2 banana                        3
#15  2 orange                        1
#16  2  apple                        2
#17  2 orange                        2
#18  2 orange                        3
#19  2  apple                        3
#20  2  apple                        4
#21  3  apple                        1
#22  3  apple                        2
#23  3 orange                        1
#24  3 banana                        1
#25  3 orange                        2
#26  3 banana                        2
#27  3  apple                        3
#28  3 banana                        3
#29  3 orange                        3
#30  3 banana                        4

Cumulative count for interspersed categorical values in a dataframe, independently for each id

Dummy Version of My Data

The Output I'm Looking For

Attempts to Solve the Problem

3 Answers3