2

Say, I have a df:

df <- data.table(user = c('a', 'a', 'b')
                 ); df

user
 a
 a
 b

I am looking to create a new column, group_id, at user level:

user group_id
 a        1
 a        1
 b        2

Is there a fast / scalable way? I can do this with a join (and understand data.table is fast with joins) but wonder if there is a simpler and faster method. Assume I have circa 1e8 rows and the method is to be applied to multiple columns in the group by (in example presented there is only 1 i.e. user)

I have started with:

df[, step_1 := as.integer(duplicated(df))]; df

user step_1
 a      0
 a      1
 b      0

but don't know what is next nor am I sure if this is even the right direction. Thank you.

Sweepy Dodo
  • 1,761
  • 9
  • 15

3 Answers3

2

’s special symbol .GRP will also work with multiple grouping columns

df[, group_id := .GRP, by = user][]
user group_id
   a        1
   a        1
   b        2

Alternatively, the rleid() function can be used but this requires that the dataset is ordered by the very same columns:

df[order(user), group_id := rleid(user)][] 
   user group_id
1:    a        1
2:    a        1
3:    b        2
Uwe
  • 41,420
  • 11
  • 90
  • 134
  • Thank you. However, `rowid` does not give the result as presented. it would give group_id = c(1,2,1) because it does an intra group row count Am I mistaken? – Sweepy Dodo Oct 15 '21 at 17:03
  • @SweepyDodo Yes, you are right, `rowid()` was the wrong function. I had meant the `.GRP` special symbol. – Uwe Oct 19 '21 at 21:44
  • Not a problem. Still appreciate your reply! A fellow stacker redirected me to a [thread](https://stackoverflow.com/questions/6112803/how-to-create-a-consecutive-group-number/69617625#69617625) where I have provided some benchmarking if you are interested p.s. yes, rleid is tricky as one has to order first – Sweepy Dodo Oct 20 '21 at 10:01
1

Convert user to factor. Then convert that to integer to extract the underlying codes that factor uses.

df[, group_id := as.integer(factor(user))]
df
##    user group_id
## 1:    a        1
## 2:    a        1
## 3:    b        2
G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341
  • Grothendireck Thank you for your prompt reply. Yours does work. However, what if there are multiple columns to be grouped by? – Sweepy Dodo Oct 15 '21 at 15:21
  • 1
    paste them together `group_id := as.integer(factor(paste(user, user2)))` – G. Grothendieck Oct 15 '21 at 15:24
  • Thank you. Silly of me to not have thought of pasting. I'm benchmarking possible solutions atm. Will comment again later – Sweepy Dodo Oct 15 '21 at 17:00
  • I have just behcnmarked this method. While it works and is read-able it does not scale well with large data. Still, it was good to learn as I had rarely used `factor` before. Thank you! p.s. took 5.27 secs on 1e7 rows – Sweepy Dodo Oct 18 '21 at 09:16
  • Some kind soul linked me to an older thread where I have now provided a benchmark result [here](https://stackoverflow.com/questions/6112803/how-to-create-a-consecutive-group-number/69617625#69617625) – Sweepy Dodo Oct 18 '21 at 14:18
1

We may also use match

df[, group_id := match(user, unique(user))]
akrun
  • 874,273
  • 37
  • 540
  • 662
  • 1
    Thanks for this. I too benchmarked this and took 0.91 sec on 1e7 rows Who would have thought a ase R solution would be this quick – Sweepy Dodo Oct 18 '21 at 09:20