3

I would like to apply a function to selected columns that match two different substrings. I've found this post related to my question but I couldn't get an answer from there.

Here is a reproducible example with my failed attempt. For the sake of this example, I want to do a row-wise operation where I sum the values from all columns starting with string v and subtract from the average of the values in all columns starting with f.

update: the proposed solution must (a) use the := operator to make the most of data.table fast performance, and (2) be flexible to other operation rather than mean and sum, which I used here just for the sake of simplicity

library(data.table)

# generate data
  dt <- data.table(id= letters[1:5],
                   v1= 1:5, 
                   v2= 1:5,
                   f1= 11:15,
                   f2= 11:15)

dt
#>    id v1 v2 f1 f2
#> 1:  a  1  1 11 11
#> 2:  b  2  2 12 12
#> 3:  c  3  3 13 13
#> 4:  d  4  4 14 14
#> 5:  e  5  5 15 15

# what I've tried
  dt[, Y := sum( .SDcols=names(dt) %like% "v" ) - mean( .SDcols=names(dt) %like% "f" ) by = id]
rafa.pereira
  • 13,251
  • 6
  • 71
  • 109

2 Answers2

2

We melt the dataset into 'long' format, by making use of the measure argument, get the difference between the sum of 'v' and mean of 'f', grouped by 'id', join on the 'id' column with the original dataset and assign (:=) the 'V1' as the 'Y' variable

dt[melt(dt, measure = patterns("^v", "^f"), value.name = c("v", "f"))[
         , sum(v) - mean(f), id], Y :=V1,  on = .(id)]

dt
#   id v1 v2 f1 f2  Y
#1:  a  1  1 11 11 -9
#2:  b  2  2 12 12 -8
#3:  c  3  3 13 13 -7
#4:  d  4  4 14 14 -6
#5:  e  5  5 15 15 -5

Or another option is with Reduce after creating index or 'v' and 'f' columns

nmv <- which(startsWith(names(dt), "v"))
nmf <- which(startsWith(names(dt), "f"))
l1 <- length(nmv)
dt[, Y := Reduce(`+`, .SD[, nmv, with = FALSE])- (Reduce(`+`, .SD[, nmf, with  = FALSE])/l1)]
akrun
  • 874,273
  • 37
  • 540
  • 662
0

rowSums and rowMeans combined with grep can accomplish this.

dt$Y <- rowMeans(dt[,grep("f", names(dt)),with=FALSE]) - rowSums(dt[,grep("v", names(dt)),with=FALSE])
pyll
  • 1,688
  • 1
  • 26
  • 44
  • Thanks for the answer. However, the solution (a) must use the `:=` operator to make the most of `data.table` fast performance, and (2) it must be flexible to other operation rather than `mean` and `sum`, which I used here just for the sake of simplicity – rafa.pereira Jun 15 '17 at 14:47
  • 1
    Ok that makes sense. Going to leave this answer up, though, since it may be helpful for somebody who does not need the additional functionality you added to the original question. – pyll Jun 15 '17 at 15:53
  • sure! Thanks again ! – rafa.pereira Jun 15 '17 at 15:54