My data has a column that contains redundant categorical values that are interspersed. I would like to indicate, in each row, the ith time each unique value appears. To add complexity, I have different ids in my dataframe, and the count has to be independent for each id.
Dummy Version of My Data
set.seed(123)
fruits <- sample(c("apple", "banana", "orange"), 30, replace = TRUE)
id <- c(rep(1, 10), rep(2, 10), rep(3, 10))
df <- as.data.frame(cbind(id, fruits))
> df
id fruits
1 1 orange
2 1 orange
3 1 orange
4 1 banana
5 1 orange
6 1 banana
7 1 banana
8 1 banana
9 1 orange
10 1 apple
11 2 banana
12 2 banana
13 2 apple
14 2 banana
15 2 orange
16 2 apple
17 2 orange
18 2 orange
19 2 apple
20 2 apple
21 3 apple
22 3 apple
23 3 orange
24 3 banana
25 3 orange
26 3 banana
27 3 apple
28 3 banana
29 3 orange
30 3 banana
The Output I'm Looking For
> df
id fruits fruit_repetitions_per_id
1 1 orange 1
2 1 orange 2
3 1 orange 3
4 1 banana 1
5 1 orange 4
6 1 banana 2
7 1 banana 3
8 1 banana 4
9 1 orange 5
10 1 apple 1
11 2 banana 1
12 2 banana 2
13 2 apple 1
14 2 banana 3
15 2 orange 1
16 2 apple 2
17 2 orange 2
18 2 orange 3
19 2 apple 3
20 2 apple 4
21 3 apple 1
22 3 apple 2
23 3 orange 1
24 3 banana 1
25 3 orange 2
26 3 banana 2
27 3 apple 3
28 3 banana 3
29 3 orange 3
30 3 banana 4
Attempts to Solve the Problem
This one is pretty much what I want, but I have my additional need to count/mark separately for each id, which that solution doesn't address.
This one is exactly what I need, but couldn't make it work and got a bunch of NAs instead:
with(df, ave(fruits, id,
FUN = function(x) cumsum(!duplicated(x))))
[1] <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
Levels: apple banana orange
Warning messages:
1: In `[<-.factor`(`*tmp*`, i, value = c(1L, 1L, 1L, 2L, 2L, 2L, 2L, :
invalid factor level, NA generated
2: In `[<-.factor`(`*tmp*`, i, value = c(1L, 1L, 2L, 2L, 3L, 3L, 3L, :
invalid factor level, NA generated
3: In `[<-.factor`(`*tmp*`, i, value = c(1L, 1L, 2L, 3L, 3L, 3L, 3L, :
invalid factor level, NA generated
Any ideas?
Thanks!