2

This is somewhat related to Create sequential counter that restarts on a condition within panel data groups and data.table "key indices" or "group counter", but not identical.

# data table:
    x y i d
 1: A B 1 1
 2: A B 1 1
 3: A C 2 2
 4: A D 3 3
 5: B A 1 4
 6: B A 1 4 
 7: C A 1 4
 8: C A 1 4 
 9: C B 2 5
10: C C 3 6
11: C C 3 6
12: C D 4 7

With dt[, d:= .GRP, by = .(x,y)] the last column can by produced. Yet I am looking for a counter that restart within every x group. See column i for the desired result.

Jonathan Hall
  • 75,165
  • 16
  • 143
  • 189
Ben L
  • 97
  • 1
  • 6
  • If you had `data.table(x = "A", y = c("A", "B", "A"))`, what would you expect `d` to be? – Martin Morgan Jun 04 '18 at 21:32
  • You can do arithmetic from your approach to get there: `tab[, g0 := .GRP, by=.(x,y)][, g := g0 - first(g0) + 1L, by=x][]` – Frank Jun 04 '18 at 21:34
  • Closing as a dupe but let me know if you disagree. I suggest reading Matt's answer, which shows approach mentioned in my last comment. – Frank Jun 04 '18 at 21:40

3 Answers3

5

You can achieve that with rleid function on y column grouped by x. rleid is a type of counter that increase each time there is a change and stay the same otherwise

library(data.table)
tab <- fread("
x y i d
A B 1 1
A B 1 1
A C 2 2
A D 3 3
B A 1 4
B A 1 4 
C A 1 4
C A 1 4 
C B 2 5
C C 3 6
C C 3 6
C D 4 7")

dt <- tab[, .(x, y, i)]
dt[, d:= rleid(y), by = .(x)]
dt
#>     x y i d
#>  1: A B 1 1
#>  2: A B 1 1
#>  3: A C 2 2
#>  4: A D 3 3
#>  5: B A 1 1
#>  6: B A 1 1
#>  7: C A 1 1
#>  8: C A 1 1
#>  9: C B 2 2
#> 10: C C 3 3
#> 11: C C 3 3
#> 12: C D 4 4

Created on 2018-06-03 by the reprex package (v0.2.0).

cderv
  • 6,272
  • 1
  • 21
  • 31
2

If your data is not ordered by y within x you can do

df[, i := .SD[, rep(.GRP, .N), y]$V1, x]

or

df[, i := {ord <- order(y); rleid(y[ord])[order(ord)]}, x]

But, if order isn't important, it's faster to just order by y before computing i

setorder(df, y) 
df[, i := rleid(y), x]

Comparison

df <- df[sample(nrow(df), 1e7, T)]

grp <- function(df) df[, i := .SD[, rep(.GRP, .N), y]$V1, x]
rleid.alone <- function(df) 
  df[, i := rleid(y), x]
setord.rleid <- function(df) {
  setorder(df, y); df[, i := rleid(y), x]}
ord.rleid <- function(df){ 
    df[, i := {ord <- order(y); rleid(y[ord])[order(ord)]}, x]}
microbenchmark(
  rleid.alone(df),
  setord.rleid(df),
  ord.rleid(df),
  grp(df),
  times = 10
)

# Unit: milliseconds
# expr                   min        lq      mean    median        uq        max neval
# rleid.alone(df)   196.5973  201.1499  237.3837  234.6709  262.0397   292.0986    10
# setord.rleid(df)  215.6894  248.7814  285.1045  273.7231  316.5271   382.6173    10
# ord.rleid(df)    7610.9995 7767.9028 8137.2361 7820.5919 8055.2610 10034.9907    10
# grp(df)           336.3208  357.3206  439.5327  394.6960  517.3482   719.8893    10
IceCreamToucan
  • 28,083
  • 2
  • 22
  • 38
1

I find it helpful to think about the algorithm first and a data.table (or base R or dplyr) application second. It seems like there are several possible algorithms to create the desired counter. I have

f0 = function(x) match(x, unique(x))

or if the values of x are to somehow be sorted

f1 = function(x) match(x, sort(unique(x)))

These are different from indexes based on runs in x

f2 = function(x) { r = rle(x); r$values = seq_along(r$values); inverse.rle(r) }

From other answers we have

f3 = function(x) { o <- order(x); rleid(x[o])[order(o)] }

and data.table::rleid().

Here's a quick comparison of the different functions

> set.seed(123); x = sample(5, 20, TRUE)
> f0(x); f1(x); f2(x); f3(x); rleid(x)
 [1] 1 2 3 4 4 5 3 4 3 3 4 3 2 3 5 4 1 5 1 4
 [1] 2 4 3 5 5 1 3 5 3 3 5 3 4 3 1 5 2 1 2 5
 [1]  1  2  3  4  4  5  6  7  8  8  9 10 11 12 13 14 15 16 17 18
 [1] 2 4 3 5 5 1 3 5 3 3 5 3 4 3 1 5 2 1 2 5
 [1]  1  2  3  4  4  5  6  7  8  8  9 10 11 12 13 14 15 16 17 18

clarifying that implementations f0-f2 are each different, and that f2() and rleid() seem to be the same at least for the domain of f, and that f1() seems to be @Ryan's solution f3().

Interestingly, the data provided in the question don't distinguish between these implementations (am I doing the data.table step right?)

> dt = tab[, .(x, y, i)]
> (dt[, .(y = y, f0 = f0(y), f1 = f1(y), f2 = f2(y), rleid = rleid(y)), by = .(x)])
    x y f0 f1 f2 rleid
 1: A B  1  1  1     1
 2: A B  1  1  1     1
 3: A C  2  2  2     2
 4: A D  3  3  3     3
 5: B A  1  1  1     1
 6: B A  1  1  1     1
 7: C A  1  1  1     1
 8: C A  1  1  1     1
 9: C B  2  2  2     2
10: C C  3  3  3     3
11: C C  3  3  3     3
12: C D  4  4  4     4

Having established the different algorithms, it may be interesting to compare performance to distinguish between alternative implementations.

> x = sample(100, 10000, TRUE)
> microbenchmark(f0(x), f1(x), f2(x), f3(x), rleid(x))
Unit: microseconds
     expr      min        lq      mean   median        uq      max neval
    f0(x)  818.773  856.5275  926.5475  880.014  906.6040 5273.431   100
    f1(x) 1026.094 1084.1425 1112.1629 1101.626 1133.4100 1384.260   100
    f2(x) 1362.461 1428.8665 1595.0777 1622.881 1672.9835 4253.685   100
    f3(x)  823.653  862.5090  893.1710  894.268  914.1290 1050.157   100
 rleid(x)  236.590  245.0090  252.4963  251.158  257.7365  309.326   100
Martin Morgan
  • 45,935
  • 7
  • 84
  • 112