data.table: Create new character column based on indicator columns values and names

Question

I have a data.table with 1.6x10^8 records and I want to create a new character column based on the indicator column names for where there is a value of 1.

For example,

library(data.table)
DT <- data.table::data.table(ID=c("a","a","a","b","b"),
                             drugA=c(1,1,1,0,0),
                             drugB=c(0,1,1,1,0),
                             drugC=c(0,0,1,0,1))

   ID drugA drugB drugC
1:  a     1     0     0
2:  a     1     1     0
3:  a     1     1     1
4:  b     0     1     0
5:  b     0     0     1

### NOTE: I know the paste0(...,collapse) argument might be helpful in concatenating the drug names as an intermediate step
   ID drugA drugB drugC          exposure
1:  a     1     0     0             drugA
2:  a     1     1     0       drugA+drugB
3:  a     1     1     1 drugA+drugB+drugC
4:  b     0     1     0             drugB
5:  b     0     0     1             drugC

I want this to be as robust as clean as possible and solely rely on data.table syntax and/or other useful packages/functions (e.g. dcast) ; I want to steer away from creating an extensive user-defined function because given my data.table size, it would take extremely long to run.

I've looked at other posts but I was unable to find something similar to my situation and desired output.

Any help would be greatly appreciated.

akrun · Accepted Answer · 2020-04-03T21:34:05.100

We can do a group by sequence of rows, specify the .SDcols as the 'drug' column, convert the Subset of Data.table (.SD) to logical, use that to subset the column names and paste them together

library(data.table)
DT[,  exposure := paste(names(.SD)[as.logical(.SD)], collapse= '+'), 
       1:nrow(DT), .SDcols = drugA:drugC]
DT
#   ID drugA drugB drugC          exposure
#1:  a     1     0     0             drugA
#2:  a     1     1     0       drugA+drugB
#3:  a     1     1     1 drugA+drugB+drugC
#4:  b     0     1     0             drugB
#5:  b     0     0     1             drugC

Or instead of grouping by rows, we can loop over the columns, change the values to column names and then paste with do.call and remove the NA elements with gsub

DT[, exposure := gsub("NA\\+|\\+NA", "", do.call(paste, 
   c(Map(function(x, y) names(.SD)[(NA^!x) * y], .SD, 
   seq_along(.SD)), sep="+"))), .SDcols = drugA:drugC]

@theneil i added one more optioin that should be faster than the first — akrun, Apr 03 '20 at 21:34
your first code post took ~25 minutes to run and your second code post tookd ~10 minutes. This was on 1.6x10^8 rows of data. Thanks — theneil, Apr 04 '20 at 00:16

score 3 · Answer 2 · answered Apr 03 '20 at 21:37

library('data.table')
DT[, id := .I]
df <- melt(DT, id.vars = 'id', measure.vars = c("drugA", "drugB", "drugC"))
df[value == 1, expose := 'exposure']
df[value == 0, expose := 'no_exposure'][, value := NULL]
df <- dcast(df, id ~ expose, fun.aggregate = function(x) paste0(x, collapse = "+"), value.var = 'variable')
DT[df, on = 'id'][, id := NULL][]
#    ID drugA drugB drugC          exposure no_exposure
# 1:  a     1     0     0             drugA drugB+drugC
# 2:  a     1     1     0       drugA+drugB       drugC
# 3:  a     1     1     1 drugA+drugB+drugC            
# 4:  b     0     1     0             drugB drugA+drugC
# 5:  b     0     0     1             drugC drugA+drugB

data.table: Create new character column based on indicator columns values and names

2 Answers2