Assign unique ID per multiple columns of data table

Question

I would like to assign unique IDs to rows of a data table per multiple column values. Let's consider a simple example:

library(data.table)       
DT = data.table(a=c(4,2,NA,2,NA), b=c("a","b","c","b","c"), c=1:5)

    a b c
1:  4 a 1
2:  2 b 2
3: NA c 3
4:  2 b 4
5: NA c 5

I'd like to generate IDs based on columns a and b and expect to get three IDs where 2nd and 4th row IDs are identical and 3rd and 5th rows have the same ID as well.

I have seen two solutions but each are slightly incomplete:

1) Solution one requires sorted data table which is very cumbersome if we need to generate IDs per many columns (in my real application, IDs are generated based on about ten columns). Can we replace cumsum function so no sorting is required?

DT$ID1 <- cumsum(!duplicated(DT[,1:2]))

2) Solution two ignores NA values; while I'd like to include NAs and assign a group ID to them

DT <- transform(DT, ID2 = as.numeric(interaction(a,b, drop=TRUE)))

I appreciate any suggestion on how to modify either of the solutions to generate the Expected_ID shown below.

    a b c ID1 ID2 Expected_ID
1:  4 a 1   1   1           1
2:  2 b 2   2   2           2
3: NA c 3   3  NA           3
4:  2 b 4   3   2           2
5: NA c 5   3  NA           3

Frank · Accepted Answer · 2018-10-04T18:08:11.580

The idiomatic way:

DT[, g := .GRP, by=.(a,b)]

    a b c g
1:  4 a 1 1
2:  2 b 2 2
3: NA c 3 3
4:  2 b 4 2
5: NA c 5 3

There's reason to believe this won't be fast, but it turns out it's not too bad compared to competing approaches:

nv = 10
nu = 3
nr = 1e6

library(data.table)
set.seed(1)
DT = do.call(CJ, rep(list(seq_len(nu)), nv))[sample(1:.N, nr, replace=TRUE)]

cols = copy(names(DT))

# "idiomatic" .GRP
system.time(DT[, g := .GRP, by=cols])
#    user  system elapsed 
#    0.23    0.02    0.25 

# sort and count runs
oi = as.call(lapply(c("order", cols), as.name))
system.time(DT[eval(oi), go := rleidv(.SD, cols)])
#    user  system elapsed 
#     0.3     0.0     0.3

# paste 'em
system.time(DT[, gp := match(p <- do.call(paste, c(.SD, list(sep="_"))), unique(p)), .SDcols=cols])
#    user  system elapsed 
#    5.26    0.06    5.32 

# paste 'em, fact'em (@akrun's answer)
system.time(DT[, gpf := as.integer(factor(p <- do.call(paste, c(.SD, list(sep="_"))), levels = unique(p))), .SDcols=cols])
#    user  system elapsed 
#    4.74    0.08    4.82 

# check
identical(DT$g, DT$gp); identical(DT$g, DT$gpf)
uniqueN(DT, "g") == uniqueN(DT, c("g", "go"))

The rleidv way creates different group numbers, but effects the same grouping.

Increasing the size of the problem to nr = 5e7 raised the time to 8s for the .GRP approach; 20s for the rleidv way; and led R to hang for the others on my system.

For anyone interested, more approaches can be found in the R FAQ How to create a consecutive index based on a grouping variable in a dataframe

I looked for a dupe here, but couldn't find any... https://stackoverflow.com/search?page=1&tab=Votes&q=%5br%5d%20is%3aa%20%22.grp%22 Of course there are countless dupes for the as.integer-factor way ... another common variant being `DT[, match(p <- paste(a,b), unique(p))]` — Frank, Oct 04 '18 at 17:19
@MichaelChirico Yeah, I guess it's not fast since `:=` and `.GRP` are not covered by GForce. I keep expecting a `grpid` function to come along that can substitute for the code I have here (to go with rowid and rleid). Anyway, I'll try out a benchmark and add if it's interesting. — Frank, Oct 04 '18 at 17:43
@MichaelChirico Ok, added. It's good compared to alternatives, at least. — Frank, Oct 04 '18 at 18:09

score 0 · Answer 2 · answered Oct 04 '18 at 17:15

0

We can use

DT[, Expected_ID := as.numeric(factor(paste(a, b), levels = unique(paste(a, b))))]

answered Oct 04 '18 at 17:15

akrun

874,273
37
540
662

Assign unique ID per multiple columns of data table

2 Answers2

Related