28

I have a data.table like this

DT <- ata.table::data.table(
  ref = rep(3L, 4L),
  nb = 12:15,
  i1 = c(3.1e-05, 0.044495, 0.82244, 0.322291),
  i2 = c(0.000183, 0.155732, 0.873416, 0.648545),
  i3 = c(0.000824, 0.533939, 0.838542, 0.990648),
  i4 = c(0.044495, 0.82244, 0.322291, 0.393595)
)
DT
#    ref nb       i1       i2       i3       i4
# 1:   3 12 0.000031 0.000183 0.000824 0.044495
# 2:   3 13 0.044495 0.155732 0.533939 0.822440
# 3:   3 14 0.822440 0.873416 0.838542 0.322291
# 4:   3 15 0.322291 0.648545 0.990648 0.393595

Now I want to calculate rows sums, but only including columns which start with an "i" ("i1", "i2", etc)

I have used grep to create a vector of the column names to be summed:

listCol <- colnames(DT)[grep("i", colnames(DT))]
listCol
# [1] "i1" "i2" "i3" "i4"

Then I have tried to loop over columns:

DT$sum <- rep.int(0, nrow(DT))
for (i in listCol){
    DT$sum = DT$sum + DT[ , get(i)]
}

...which gives the desired output:

DT
#    ref nb       i1       i2       i3       i4      sum
# 1:   3 12 0.000031 0.000183 0.000824 0.044495 0.045533
# 2:   3 13 0.044495 0.155732 0.533939 0.822440 1.556606
# 3:   3 14 0.822440 0.873416 0.838542 0.322291 2.856689
# 4:   3 15 0.322291 0.648545 0.990648 0.393595 2.355079

How can I improve my code?


Sub question:

This sub-question includes partially the answer to the previous one :

How to avoid this kind of strange notation :

myrowMeans = function (x){
    rowMeans(x, na.rm = TRUE)
}
DT[ , var := myrowMeans(.SD-myrowMeans(.SD)^2), .SDcols = grep("i", colnames(DT))]
moodymudskipper
  • 46,417
  • 11
  • 121
  • 167
user235852
  • 295
  • 1
  • 3
  • 7

3 Answers3

45

Use .SDcols to specify the columns, then take rowSums. Use := to assign new columns:

DT[ ,sum := rowSums(.SD), .SDcols = grep("i", names(DT))]
Hugh
  • 15,521
  • 12
  • 57
  • 100
40

You may also try with Reduce

 DT[, Sum := Reduce(`+`, .SD), .SDcols=listCol][]
 #   ref nb       i1       i2       i3       i4      Sum
 #1:   3 12 0.000031 0.000183 0.000824 0.044495 0.045533
 #2:   3 13 0.044495 0.155732 0.533939 0.822440 1.556606
 #3:   3 14 0.822440 0.873416 0.838542 0.322291 2.856689
 #4:   3 15 0.322291 0.648545 0.990648 0.393595 2.355079

NOTE: If there are "NA" values, it should be replaced with '0' before Reduce i.e.

 DT[, Sum := Reduce(`+`, lapply(.SD, function(x) replace(x, 
                    which(is.na(x)), 0))), .SDcols=listCol][]

**Another solution :**using rowSums

 DT[, Sum := rowSums(.SD, na.rm = TRUE), .SDcols = grep("i", names(DT))] 
Peter Chen
  • 1,464
  • 3
  • 21
  • 48
akrun
  • 874,273
  • 37
  • 540
  • 662
  • D'oh. I'm too slow; just said the same thing. @user235852 The `rowSums` function coerces to matrix before operating, so I suspect this is generally faster. – Frank Mar 26 '15 at 03:15
  • 4
    @Frank Not sure though. I had seen data.table experts using `rowSums`. One advantage with `rowSums` is the use of `na.rm=TRUE` in case there are NAs. With Reduce, we have to replace NA with `0` before proceeding with `+`. – akrun Mar 26 '15 at 03:17
  • 1
    I can t express my pleasure to see your answers ( I can t because I need reputation to vote up ) The best way to replace NA by 0 is describe here : http://stackoverflow.com/questions/7235657/fastest-way-to-replace-nas-in-a-large-data-table – user235852 Mar 27 '15 at 19:29
  • @user235852 Thanks for the link. Yes, I would also use `set` within the `for` loop for multiple columns. Here, it may be take a few more lines, so if your dataset is not very big, this would work fine. – akrun Mar 28 '15 at 03:34
  • 2
    @akrun maybe like this? `DT[, Sum := rowSums(.SD, na.rm = T), .SDcols = grep("i", names(DT))]` – Peter Chen Jun 14 '17 at 02:33
  • I have a small question about this. If I want to sum column (i1:i4) and create TOTAL under `ref`, how can I do? I post here: https://stackoverflow.com/questions/44517786/create-new-row-and-column-by-summing-multiple-column-and-row – Peter Chen Jun 14 '17 at 07:58
  • @PeterChen Just specify the `.SDcols` – akrun Jun 14 '17 at 08:14
  • but how to create a word "TOTAL" which shows similar as the link I post in comment above? – Peter Chen Jun 14 '17 at 08:17
  • I am to slow to grasp this solution and it looks very neat. Do you care to explain in details how it works? For instance, what is `.SDcols`? what is this operator `:=`. It looks I would need to read some more to fully understand this piece of code. – Faustin Gashakamba Sep 24 '22 at 09:02
1

The dplyr solution would be to use mutate_ together with paste(listCol, collapse = "+"). But I guess the Reduce solution is faster.

DT <- mutate_(DT, sum = paste(listCol, collapse = "+"))
shadow
  • 21,823
  • 4
  • 63
  • 77