2

I have a data.table with many numbered columns. As a simpler example, I have this:

dat <- data.table(cbind(col1=sample(1:5,10,replace=T),
                            col2=sample(1:5,10,replace=T),
                            col3=sample(1:5,10,replace=T),
                            col4=sample(1:5,10,replace=T)),
oneMoreCol='a')

I want to create a new column as follows: In each row, we add the values in columns from among col1-col4 if the value is not NA or 1.

My current code for this has two for-loops which is clearly not the way to do it:

for(i in 1:nrow(dat)){
  dat[i,'sumCol':={temp=0;
  for(j in 1:4){if(!is.na(dat[i,paste0('col',j),with=F])&
                   dat[i,paste0('col',j),with=F]!=1
                                 ){temp=temp+dat[i,paste0('col',j),with=F]}};
  temp}]}

I would appreciate any advice on how to remove this for-loops. My code is running on a bigger data.table and it takes a long time to run.

Jaap
  • 81,064
  • 34
  • 182
  • 193
lonestar79
  • 33
  • 3
  • 2
    In addition to `rowSums`, some other alternatives here: [Summing across rows of a data.table for specific columns](https://stackoverflow.com/questions/21857679/summing-across-rows-of-a-data-table-for-specific-columns), i.e. `Reduce` ("_it will not convert to matrix as `rowSums` does_") and working in long format after `melt` (often more convenient). – Henrik Jun 18 '20 at 14:49

1 Answers1

2

A possible solution:

dat[, sumCol := rowSums(.SD * (.SD != 1), na.rm = TRUE), .SDcols = col1:col4]

which gives:

> dat
    col1 col2 col3 col4 oneMoreCol sumCol
 1:    4    5    5    3          a     17
 2:    4    5   NA    5          a     14
 3:    2    3    4    3          a     12
 4:    1    2    3    4          a      9
 5:    4    3   NA    5          a     12
 6:    2    2    1    4          a      8
 7:   NA    2   NA    5          a      7
 8:    4    2    2    4          a     12
 9:    4    1    5    4          a     13
10:    2    1    5    1          a      7

Used data:

set.seed(20200618)
dat <- data.table(cbind(col1=sample(c(NA, 1:5),10,replace=T),
                        col2=sample(1:5,10,replace=T),
                        col3=sample(c(1:5,NA),10,replace=T),
                        col4=sample(1:5,10,replace=T)),
                  oneMoreCol='a')
Jaap
  • 81,064
  • 34
  • 182
  • 193
  • AFAIK, `is.na` and logical operators coerce data.tables/.frames to matrix. That means copies and is inefficient. Better use `Reduce`. – Roland Jun 18 '20 at 14:47
  • Thank you, Jaap. This answers my question and I will mark it as such. Can I please follow up with an extension question? Could we extend your solution to cover the case where the columns to be added were different from the columns where we check the condition? For instance, add columns colA1,colA2,...,colA10 if the corresponding values in colB1,colB2,...colB10 satisfy a condition? – lonestar79 Jun 18 '20 at 15:01
  • @Roland I didn't know that. But I don't see an easy solution using `Reduce` that also takes the condition into account. Going long form with `melt` as Henrik suggested might be a better solution I think. – Jaap Jun 18 '20 at 15:06
  • @lonestar79 you might use something like `dat[, sumCol := rowSums(.SD * dat[, .SD != 1, .SDcols = colB1:colB10], na.rm = TRUE), .SDcols = colA1:colA10]` – Jaap Jun 18 '20 at 15:10