2

When I perform .SD operations using data.table, I often encounter situations in which it would be useful to access column name attributes within the lapply/.SD statement. Normally, situations like these arise when I need to perform a data.table operation which involves columns of an external data.table.

Say for instance that I have a data.table dt with two columns. In addition, I have data.table mult which serves as a "multiplication matrix", e.g. it contains factors by which I want the columns in dt to be multiplied with.

dt = data.table(Val1 = rep(1,5), Val2 = rep(2,5))
mult = data.table(Val1 = 5, Val2 = 10)

> dt
   Val1 Val2
1:    1    2
2:    1    2
3:    1    2
4:    1    2
5:    1    2

> mult
   Val1 Val2
1:    5   10

In this elementary example, I want to multiply Val1 and Val2 in dt with the respective multiplication factors in mult. Using base R, the following statement could be applied using sapply:

mat = sapply(colnames(dt), function(x){
  dt[[x]] * mult[[x]]
})

> data.table(mat)
   Val1 Val2
1:    5   20
2:    5   20
3:    5   20
4:    5   20
5:    5   20

Now, this works because sapply is applied across the column names of dt, not the columns themselves.

Say that I would like to perform the same operation using data.table/.SD. The issue here is that I cannot find a way of accessing the 'current' column name within the lapply statement considering that we iterate over the entire subset, not the names. Hence, I cannot index and source the appropriate multiplication factor from the mult table from within the lapply statement.

The psuedocode of what I would like to do is below:

dt[, lapply(.SD, function(x){

 # name = name of the iterated xth column in .SD, i.e. first 'Val1' and then 'Val2' )
 # return(x*mult[[name]])  

}), .SDcols = c('Val1', 'Val2')]

I am aware that there are workarounds available using expressive indexation in the lapply statement (i.e. lapply(1:ncol(dt)){...}), but I would like to understand whether it is feasible using .SD instead.

Thank you in advance.

JDG
  • 1,342
  • 8
  • 18
  • 1
    lapply(names(.SD), function(name) get(name)*mult[[name]]) or .SD[[name]]*mult[[name]]? – chinsoon12 Oct 18 '19 at 13:00
  • Very elegant. I am however apprehensive of using get(x) in data.table expressions, see for instance https://stackoverflow.com/questions/48234064/getx-does-not-work-in-r-data-table-when-x-is-also-a-column-in-the-data-table – JDG Oct 18 '19 at 13:06
  • 1
    This solution does many of the things you want. You'd have to modify the solution a bit to meet your needs but it makes use of the `:=` and some tricky unlist/list capabilities. https://stackoverflow.com/a/57386591/8485287 – Adam Sampson Oct 18 '19 at 13:37

2 Answers2

2

You could use the Map function with mult being a vector:

mult <- c(5, 10)
dt[, Map("*", .SD, mult), .SDcols = c("Val1", "Val2")]

UPDATE

If mult needs to be a data.table because only a subset of columns will be used, you can use this. This should solve your problem: columns set in .SDcols are multiplied by the respective columns in mult, without concern for ordering, dimensions, etc.

dt = data.table(Val1 = rep(1,5), Val2 = rep(2,5), Val3 = rep(3, 5))
mult = data.table(Val1 = 5, Val2 = 10, Val3 = 15)

dt[, Map("*", .SD, mult[, names(.SD), with = FALSE]), .SDcols = c("Val1", "Val3")]
#    Val1 Val3
# 1:    5   45
# 2:    5   45
# 3:    5   45
# 4:    5   45
# 5:    5   45
Biblot
  • 695
  • 3
  • 18
  • 1
    Appreciate this, however this solution is sensitive to the length and ordering of columns in `mult`. What if for instance `mult` had a third column Val3 which is to be used at a later stage? It is important that the name of x is directly matched to the appropriate column in `mult`, i.e. `mult[[name]]`. – JDG Oct 18 '19 at 13:12
  • @J.G. Please see my update to solve this problem. I simply subset `mult` with the columns declared in `.SDcols`. Thus, the solution isn't sensitive to ordering of columns in `mult`. However, if `mult` were to have multiple rows, I think recycling would occur. – Biblot Oct 18 '19 at 13:40
  • 1
    Thank you Samuel. I actually have not used `Map()` within this scope before, interesting. – JDG Oct 21 '19 at 12:15
0

My answer is not based on lapply function but since list elements become columns in the data.table output, consider the following possibility.

dt[, Map(`*`, .SD, mult)]

#    Val1  Val2
#1:     5    20
#2:     5    20
#3:     5    20
#4:     5    20
#5:     5    20
  • Thank you. Although useful for this example in particular, my core issue here is that I want to retrieve the column name of the current iteration in .SD – JDG Oct 18 '19 at 13:15