I find it helpful to think about the algorithm first and a data.table (or base R or dplyr) application second. It seems like there are several possible algorithms to create the desired counter. I have
f0 = function(x) match(x, unique(x))
or if the values of x are to somehow be sorted
f1 = function(x) match(x, sort(unique(x)))
These are different from indexes based on runs in x
f2 = function(x) { r = rle(x); r$values = seq_along(r$values); inverse.rle(r) }
From other answers we have
f3 = function(x) { o <- order(x); rleid(x[o])[order(o)] }
and data.table::rleid()
.
Here's a quick comparison of the different functions
> set.seed(123); x = sample(5, 20, TRUE)
> f0(x); f1(x); f2(x); f3(x); rleid(x)
[1] 1 2 3 4 4 5 3 4 3 3 4 3 2 3 5 4 1 5 1 4
[1] 2 4 3 5 5 1 3 5 3 3 5 3 4 3 1 5 2 1 2 5
[1] 1 2 3 4 4 5 6 7 8 8 9 10 11 12 13 14 15 16 17 18
[1] 2 4 3 5 5 1 3 5 3 3 5 3 4 3 1 5 2 1 2 5
[1] 1 2 3 4 4 5 6 7 8 8 9 10 11 12 13 14 15 16 17 18
clarifying that implementations f0-f2 are each different, and that f2()
and rleid()
seem to be the same at least for the domain of f, and that f1()
seems to be @Ryan's solution f3()
.
Interestingly, the data provided in the question don't distinguish between these implementations (am I doing the data.table step right?)
> dt = tab[, .(x, y, i)]
> (dt[, .(y = y, f0 = f0(y), f1 = f1(y), f2 = f2(y), rleid = rleid(y)), by = .(x)])
x y f0 f1 f2 rleid
1: A B 1 1 1 1
2: A B 1 1 1 1
3: A C 2 2 2 2
4: A D 3 3 3 3
5: B A 1 1 1 1
6: B A 1 1 1 1
7: C A 1 1 1 1
8: C A 1 1 1 1
9: C B 2 2 2 2
10: C C 3 3 3 3
11: C C 3 3 3 3
12: C D 4 4 4 4
Having established the different algorithms, it may be interesting to compare performance to distinguish between alternative implementations.
> x = sample(100, 10000, TRUE)
> microbenchmark(f0(x), f1(x), f2(x), f3(x), rleid(x))
Unit: microseconds
expr min lq mean median uq max neval
f0(x) 818.773 856.5275 926.5475 880.014 906.6040 5273.431 100
f1(x) 1026.094 1084.1425 1112.1629 1101.626 1133.4100 1384.260 100
f2(x) 1362.461 1428.8665 1595.0777 1622.881 1672.9835 4253.685 100
f3(x) 823.653 862.5090 893.1710 894.268 914.1290 1050.157 100
rleid(x) 236.590 245.0090 252.4963 251.158 257.7365 309.326 100