nth group labelling (count only graduates with new group)

Question

Say, I have a df:

df <- data.table(user = c('a', 'a', 'b')
                 ); df

user
 a
 a
 b

I am looking to create a new column, group_id, at user level:

user group_id
 a        1
 a        1
 b        2

Is there a fast / scalable way? I can do this with a join (and understand data.table is fast with joins) but wonder if there is a simpler and faster method. Assume I have circa 1e8 rows and the method is to be applied to multiple columns in the group by (in example presented there is only 1 i.e. user)

I have started with:

df[, step_1 := as.integer(duplicated(df))]; df

user step_1
 a      0
 a      1
 b      0

but don't know what is next nor am I sure if this is even the right direction. Thank you.

Uwe · Answer 1 · 2021-10-19T21:39:40.177

2

data.table’s special symbol .GRP will also work with multiple grouping columns

df[, group_id := .GRP, by = user][]

user group_id
   a        1
   a        1
   b        2

Alternatively, the rleid() function can be used but this requires that the dataset is ordered by the very same columns:

df[order(user), group_id := rleid(user)][]

   user group_id
1:    a        1
2:    a        1
3:    b        2

edited Oct 19 '21 at 21:39

answered Oct 15 '21 at 16:48

Uwe

41,420
11
90
134

Thank you. However, `rowid` does not give the result as presented. it would give group_id = c(1,2,1) because it does an intra group row count Am I mistaken? – Sweepy Dodo Oct 15 '21 at 17:03
@SweepyDodo Yes, you are right, `rowid()` was the wrong function. I had meant the `.GRP` special symbol. – Uwe Oct 19 '21 at 21:44
Not a problem. Still appreciate your reply! A fellow stacker redirected me to a [thread](https://stackoverflow.com/questions/6112803/how-to-create-a-consecutive-group-number/69617625#69617625) where I have provided some benchmarking if you are interested p.s. yes, rleid is tricky as one has to order first – Sweepy Dodo Oct 20 '21 at 10:01

score 1 · Answer 2 · answered Oct 15 '21 at 15:15

1

Convert user to factor. Then convert that to integer to extract the underlying codes that factor uses.

df[, group_id := as.integer(factor(user))]
df
##    user group_id
## 1:    a        1
## 2:    a        1
## 3:    b        2

answered Oct 15 '21 at 15:15

G. Grothendieck

254,981
17
203
341

Grothendireck Thank you for your prompt reply. Yours does work. However, what if there are multiple columns to be grouped by? – Sweepy Dodo Oct 15 '21 at 15:21
1

paste them together `group_id := as.integer(factor(paste(user, user2)))` – G. Grothendieck Oct 15 '21 at 15:24
Thank you. Silly of me to not have thought of pasting. I'm benchmarking possible solutions atm. Will comment again later – Sweepy Dodo Oct 15 '21 at 17:00
I have just behcnmarked this method. While it works and is read-able it does not scale well with large data. Still, it was good to learn as I had rarely used `factor` before. Thank you! p.s. took 5.27 secs on 1e7 rows – Sweepy Dodo Oct 18 '21 at 09:16
Some kind soul linked me to an older thread where I have now provided a benchmark result [here](https://stackoverflow.com/questions/6112803/how-to-create-a-consecutive-group-number/69617625#69617625) – Sweepy Dodo Oct 18 '21 at 14:18

score 1 · Accepted Answer · answered Oct 15 '21 at 17:46

1

We may also use match

df[, group_id := match(user, unique(user))]

answered Oct 15 '21 at 17:46

akrun

874,273
37
540
662

1

Thanks for this. I too benchmarked this and took 0.91 sec on 1e7 rows Who would have thought a ase R solution would be this quick – Sweepy Dodo Oct 18 '21 at 09:20

nth group labelling (count only graduates with new group)

3 Answers3