4

I have a data.table with 1.6x10^8 records and I want to create a new character column based on the indicator column names for where there is a value of 1.

For example,

library(data.table)
DT <- data.table::data.table(ID=c("a","a","a","b","b"),
                             drugA=c(1,1,1,0,0),
                             drugB=c(0,1,1,1,0),
                             drugC=c(0,0,1,0,1))

   ID drugA drugB drugC
1:  a     1     0     0
2:  a     1     1     0
3:  a     1     1     1
4:  b     0     1     0
5:  b     0     0     1

### NOTE: I know the paste0(...,collapse) argument might be helpful in concatenating the drug names as an intermediate step
   ID drugA drugB drugC          exposure
1:  a     1     0     0             drugA
2:  a     1     1     0       drugA+drugB
3:  a     1     1     1 drugA+drugB+drugC
4:  b     0     1     0             drugB
5:  b     0     0     1             drugC

I want this to be as robust as clean as possible and solely rely on data.table syntax and/or other useful packages/functions (e.g. dcast) ; I want to steer away from creating an extensive user-defined function because given my data.table size, it would take extremely long to run.

I've looked at other posts but I was unable to find something similar to my situation and desired output.

Any help would be greatly appreciated.

theneil
  • 488
  • 1
  • 4
  • 14

2 Answers2

3

We can do a group by sequence of rows, specify the .SDcols as the 'drug' column, convert the Subset of Data.table (.SD) to logical, use that to subset the column names and paste them together

library(data.table)
DT[,  exposure := paste(names(.SD)[as.logical(.SD)], collapse= '+'), 
       1:nrow(DT), .SDcols = drugA:drugC]
DT
#   ID drugA drugB drugC          exposure
#1:  a     1     0     0             drugA
#2:  a     1     1     0       drugA+drugB
#3:  a     1     1     1 drugA+drugB+drugC
#4:  b     0     1     0             drugB
#5:  b     0     0     1             drugC

Or instead of grouping by rows, we can loop over the columns, change the values to column names and then paste with do.call and remove the NA elements with gsub

DT[, exposure := gsub("NA\\+|\\+NA", "", do.call(paste, 
   c(Map(function(x, y) names(.SD)[(NA^!x) * y], .SD, 
   seq_along(.SD)), sep="+"))), .SDcols = drugA:drugC]
akrun
  • 874,273
  • 37
  • 540
  • 662
  • @theneil i added one more optioin that should be faster than the first – akrun Apr 03 '20 at 21:34
  • 1
    your first code post took ~25 minutes to run and your second code post tookd ~10 minutes. This was on 1.6x10^8 rows of data. Thanks – theneil Apr 04 '20 at 00:16
3
library('data.table')
DT[, id := .I]
df <- melt(DT, id.vars = 'id', measure.vars = c("drugA", "drugB", "drugC"))
df[value == 1, expose := 'exposure']
df[value == 0, expose := 'no_exposure'][, value := NULL]
df <- dcast(df, id ~ expose, fun.aggregate = function(x) paste0(x, collapse = "+"), value.var = 'variable')
DT[df, on = 'id'][, id := NULL][]
#    ID drugA drugB drugC          exposure no_exposure
# 1:  a     1     0     0             drugA drugB+drugC
# 2:  a     1     1     0       drugA+drugB       drugC
# 3:  a     1     1     1 drugA+drugB+drugC            
# 4:  b     0     1     0             drugB drugA+drugC
# 5:  b     0     0     1             drugC drugA+drugB
Sathish
  • 12,453
  • 3
  • 41
  • 59