I would like to assign unique IDs to rows of a data table per multiple column values. Let's consider a simple example:
library(data.table)
DT = data.table(a=c(4,2,NA,2,NA), b=c("a","b","c","b","c"), c=1:5)
a b c
1: 4 a 1
2: 2 b 2
3: NA c 3
4: 2 b 4
5: NA c 5
I'd like to generate IDs based on columns a and b and expect to get three IDs where 2nd and 4th row IDs are identical and 3rd and 5th rows have the same ID as well.
I have seen two solutions but each are slightly incomplete:
1) Solution one requires sorted data table which is very cumbersome if we need to generate IDs per many columns (in my real application, IDs are generated based on about ten columns). Can we replace cumsum function so no sorting is required?
DT$ID1 <- cumsum(!duplicated(DT[,1:2]))
2) Solution two ignores NA values; while I'd like to include NAs and assign a group ID to them
DT <- transform(DT, ID2 = as.numeric(interaction(a,b, drop=TRUE)))
I appreciate any suggestion on how to modify either of the solutions to generate the Expected_ID shown below.
a b c ID1 ID2 Expected_ID
1: 4 a 1 1 1 1
2: 2 b 2 2 2 2
3: NA c 3 3 NA 3
4: 2 b 4 3 2 2
5: NA c 5 3 NA 3