Row operations on selected columns based on substring in data.table

Question

I would like to apply a function to selected columns that match two different substrings. I've found this post related to my question but I couldn't get an answer from there.

Here is a reproducible example with my failed attempt. For the sake of this example, I want to do a row-wise operation where I sum the values from all columns starting with string v and subtract from the average of the values in all columns starting with f.

update: the proposed solution must (a) use the := operator to make the most of data.table fast performance, and (2) be flexible to other operation rather than mean and sum, which I used here just for the sake of simplicity

library(data.table)

# generate data
  dt <- data.table(id= letters[1:5],
                   v1= 1:5, 
                   v2= 1:5,
                   f1= 11:15,
                   f2= 11:15)

dt
#>    id v1 v2 f1 f2
#> 1:  a  1  1 11 11
#> 2:  b  2  2 12 12
#> 3:  c  3  3 13 13
#> 4:  d  4  4 14 14
#> 5:  e  5  5 15 15

# what I've tried
  dt[, Y := sum( .SDcols=names(dt) %like% "v" ) - mean( .SDcols=names(dt) %like% "f" ) by = id]

The desired output of `Y` is `-9, -8, -7, -6, -5` – rafa.pereira Jun 15 '17 at 14:51 — rafa.pereira, Jun 15 '17 at 14:51

akrun · Accepted Answer · 2017-06-15T15:10:52.220

We melt the dataset into 'long' format, by making use of the measure argument, get the difference between the sum of 'v' and mean of 'f', grouped by 'id', join on the 'id' column with the original dataset and assign (:=) the 'V1' as the 'Y' variable

dt[melt(dt, measure = patterns("^v", "^f"), value.name = c("v", "f"))[
         , sum(v) - mean(f), id], Y :=V1,  on = .(id)]

dt
#   id v1 v2 f1 f2  Y
#1:  a  1  1 11 11 -9
#2:  b  2  2 12 12 -8
#3:  c  3  3 13 13 -7
#4:  d  4  4 14 14 -6
#5:  e  5  5 15 15 -5

Or another option is with Reduce after creating index or 'v' and 'f' columns

nmv <- which(startsWith(names(dt), "v"))
nmf <- which(startsWith(names(dt), "f"))
l1 <- length(nmv)
dt[, Y := Reduce(`+`, .SD[, nmv, with = FALSE])- (Reduce(`+`, .SD[, nmf, with  = FALSE])/l1)]

score 0 · Answer 2 · answered Jun 15 '17 at 14:41

0

rowSums and rowMeans combined with grep can accomplish this.

dt$Y <- rowMeans(dt[,grep("f", names(dt)),with=FALSE]) - rowSums(dt[,grep("v", names(dt)),with=FALSE])

answered Jun 15 '17 at 14:41

pyll

1,688
1
26
44

Thanks for the answer. However, the solution (a) must use the `:=` operator to make the most of `data.table` fast performance, and (2) it must be flexible to other operation rather than `mean` and `sum`, which I used here just for the sake of simplicity – rafa.pereira Jun 15 '17 at 14:47
1

Ok that makes sense. Going to leave this answer up, though, since it may be helpful for somebody who does not need the additional functionality you added to the original question. – pyll Jun 15 '17 at 15:53
sure! Thanks again ! – rafa.pereira Jun 15 '17 at 15:54

Row operations on selected columns based on substring in data.table

2 Answers2