9

I have nc columns in a data.table, and nc scalars in a vector. I want to take a linear combination of the columns, but I don't know ahead of time which columns I will be using. What is the most efficient way to do this?

setup

require(data.table)
set.seed(1)

n  <- 1e5
nc <- 5
cf <- setNames(rnorm(nc),LETTERS[1:nc])
DT <- setnames(data.table(replicate(nc,rnorm(n))),LETTERS[1:nc])

ways to do it

Suppose I want to use the first four columns. I can manually write:

DT[,list(cf['A']*A+cf['B']*B+cf['C']*C+cf['D']*D)]

I can think of two automatic ways (that work without knowing that A-E should all be used):

mycols <- LETTERS[1:4] # the first four columns
DT[,list(as.matrix(.SD)%*%cf[mycols]),.SDcols=mycols]
DT[,list(Reduce(`+`,Map(`*`,cf[mycols],.SD))),.SDcols=mycols]

benchmarking

I expect the as.matrix to make the second option slow, and really have no intuition for the speed of Map-Reduce combinations.

require(rbenchmark)
options(datatable.verbose=FALSE) # in case you have it turned on

benchmark(
    manual=DT[,list(cf['A']*A+cf['B']*B+cf['C']*C+cf['D']*D)],
    coerce=DT[,list(as.matrix(.SD)%*%cf[mycols]),.SDcols=mycols],
    maprdc=DT[,list(Reduce(`+`,Map(`*`,cf[mycols],.SD))),.SDcols=mycols]
)[,1:6]

    test replications elapsed relative user.self sys.self
2 coerce          100    2.47    1.342      1.95     0.51
1 manual          100    1.84    1.000      1.53     0.31
3 maprdc          100    2.40    1.304      1.62     0.75

I get anywhere from a 5% to 40% percent slowdown relative to the manual approach when I repeat the benchmark call.

my application

The dimensions here -- n and length(mycols) -- are close to what I am working with, but I will be running these computations many times, altering the coefficient vector, cf.

Frank
  • 66,179
  • 8
  • 96
  • 180

2 Answers2

8

This is almost 2x faster for me than your manual version:

Reduce("+", lapply(names(DT), function(x) DT[[x]] * cf[x]))

benchmark(manual = DT[, list(cf['A']*A+cf['B']*B+cf['C']*C+cf['D']*D)],
          reduce = Reduce('+', lapply(names(DT), function(x) DT[[x]] * cf[x])))
#    test replications elapsed relative user.self sys.self user.child sys.child
#1 manual          100    1.43    1.744      1.08     0.36         NA        NA
#2 reduce          100    0.82    1.000      0.58     0.24         NA        NA

And to iterate over just mycols, replace names(DT) with mycols in lapply.

Frank
  • 66,179
  • 8
  • 96
  • 180
eddi
  • 49,088
  • 6
  • 104
  • 155
  • Thanks, that is quite an improvement! I'll wait to mark it as an answer in case you or someone else come up with something even faster. – Frank Oct 09 '13 at 18:19
1

Add this option to your benchmark call:

ops = as.matrix(DT) %*% cf

On my device it was 30% faster than the matrix multiplication you tried.

IRTFM
  • 258,963
  • 21
  • 364
  • 487
  • Thanks, DWin. I have edited the question to clarify that my use-case involves a subset of columns. Perhaps I should also alter it to use `newcol:=...` to clarify that I want to have the column inside the data.table at the end. I did not find `DT[,list(as.matrix(DT[,mycols,with=FALSE]) %*% cf)]` faster. – Frank Oct 09 '13 at 18:15
  • 1
    @frank just pad the vector with zeros for columns that you don't want to sum. Or if the number of columns is much greater than the subset of columns you are interested in, then subset the data table, convert to matrix, and call with the unpadded vector. In any case, my bet is the vectorized linear algebra approach here will be one of the fastest ways to do it. – Clayton Stanley Oct 10 '13 at 04:44
  • @ClaytonStanley That's a good, mathematically correct suggestion, but I guess it might be computationally costly if the number of columns in DT is far larger than the number of columns I want to use. – Frank Oct 10 '13 at 04:49
  • 1
    @ClaytonStanley Ah, I see. My hunch is that matrix operations are overkill here, since the output is not going to be a matrix; and that coding up a fast Rcpp `sweep` and then an `apply` with `sum` would be fastest. For now, I'm going to go with eddi's approach, but I'd be interested to hear of any benchmarks that suggest there's room for significant improvement. Thanks for giving it some thought. :) – Frank Oct 10 '13 at 04:56
  • @Frank Just noticed that you are also running this computation many times with different coefficient vectors. If that's the case, I might relook at DWin's linear algebra approach and multiply the coerced matrix (or subset if the columns don't change in the coefficient vectors) by a coefficient matrix. – Clayton Stanley Oct 10 '13 at 14:53
  • @ClaytonStanley My optimization routine only uses one candidate coefficient vector at a time, but I'd do what you describe if I was using a fancier method. Thanks – Frank Oct 10 '13 at 14:57