2

I have a data table dt, for example:

     a  b  c
[1]  1  2  3
[2]  2  3  4
[3]  3  4  5
[4]  4  5  6

I want to multiply the values from every row of my dt with the values from the vector vec:

vec

1  0  0

I expect the following result for the output dt:

     a  b  c
[1]  1  0  0
[2]  2  0  0
[3]  3  0  0
[4]  4  0  0

I have solved this problem in a for loop. Is there any better (vectorized) and faster way to solve this problem? I sometimes have data tables with thousands of columns that is why the loop gets very slow. I would also like to possibly keep data table format and avoid converting. However, in the end, the solution with the fastest runtime is important for me.

gdol
  • 165
  • 10
  • Already answered here: https://stackoverflow.com/questions/3643555/multiply-rows-of-matrix-by-vector – sebpardo Mar 21 '19 at 16:44
  • This is a matrix solution and requires to convert a data.table to a matrix, or? Is it valid for a data table as well? – gdol Mar 21 '19 at 16:50
  • Yes, `sweep` seems to work with data.frames in addition to matrices. There's no need to convert – divibisan Mar 21 '19 at 16:52
  • @sebpardo When you see that a question has been answered before, you should flag it as duplicate – divibisan Mar 21 '19 at 16:53
  • Since this is a `data.table`, not a `matrix`, I would not close as duplicate. It's not clear (or mentioned at the proposed dupe) that `sweep` works on data tables without conversion, nor whether more efficient `data.table` answers are possible. I'd much rather OP provide simulation code for a *large* `data.table` representative of the actual use case and suitable for benchmarking. – Gregor Thomas Mar 21 '19 at 16:59
  • 1
    I'd also encourage OP to share their `for` loop. I suspect a for loop with `set` would be quite fast... I'd be willing to bet it will be faster than `sweep` on a large `data.table`, with a `data.table` as the output. – Gregor Thomas Mar 21 '19 at 17:00
  • It looks like both `sweep` and `t(t(df) * vec)` work on data.table, but the output is a `data.frame` or `matrix` respectively. I don't know if it takes any time to convert back from data.frame to data.table – divibisan Mar 21 '19 at 17:06
  • @divibisan `setDT()` "converts" a dataframe to a datatable in no time – DanY Mar 21 '19 at 17:11
  • Yes, and `sweep` and `t` convert the `data.table` to a `matrix`, a relatively expensive operation on a large table (OP mentions "thousands of columns"). Which is why I think a for loop with `set` will be quite a bit faster on large input. Whether `NA` values are present will also impact timing. – Gregor Thomas Mar 21 '19 at 17:17
  • Yeah, if you want to modify the table in place, `set` should work, and if not, then ``setDT(Map(`*`, dt, as.list(vec)))``. It would help to see code in the question for a reproducible example that's large enough in the dimensions you're concerned about (many columns? many rows?) – Frank Mar 21 '19 at 17:19
  • 1
    I don't know if it would help in this particular operation, but I would consider working with matrix in general for such large data. I've found that operations on matrices are generally faster. Internally, a matrix is kept as a single vector, whereas a data.frame/data.table is a series of vectors. – thc Mar 21 '19 at 17:31
  • Yeah. Sorta depends... if OP wants this done on the whole thing that's a good indicator they should probably be using a `matrix`. If there's other columns, especially non-numeric columns, and previous/future steps will be grouped operations than keeping it `data.table` seems better. – Gregor Thomas Mar 21 '19 at 17:35
  • 1
    I'd still like OP to post sample input before answering, but on a 1000x5000 data.table `x` I'm seeing `for (col in 1:ncol(x)) set(x, j = col, value = x[[col]] * vec[col])` as about 10x faster than `setDT(x <- sweep(x, MARGIN = 2, vec, "*"))`. – Gregor Thomas Mar 21 '19 at 17:41
  • @Gregor If you've already done the comparisons, the benefit to SO of not requiring gdol to post a [MCVE] would be greater than possibly not posting if she fails to clear that hurdle. I realize you would be possibly not getting the teaching point about proper SO behavior across to a newbie, but I also think you should think about future querents. – IRTFM Mar 21 '19 at 17:59
  • Yeah, my hesitancy was just to make sure *all* answers can use the same data, but it seems doubtful other people are working on answers currently so you're right I should just post. Then anyone can use my data. – Gregor Thomas Mar 21 '19 at 18:01

1 Answers1

3

On a relatively large 5000x5000 data table, a for loop over columns using set is the fastest method I could find. Here are the other methods I tried, taken from Multiply rows of matrix by vector. Methods are sorted in order of performance, though the last two are nearly indistinguishable at this scale.

## sample data
nr = 5000
nc = 5000
set.seed(47)
raw_matrix = matrix(rpois(nr * nc, lambda = 10), nrow = nr)
vec = rpois(nc, lambda = 2)


## For loop with set
# reset the data table
x = as.data.table(raw_matrix)
t0 = Sys.time()
for (col in 1:ncol(x)) set(x, j = col, value = x[[col]] * vec[col])
(set_time = Sys.time() - t0)
# Time difference of 0.151 secs


## Transpose and multiply
# reset the data table
x = as.data.table(raw_matrix)
t0 = Sys.time()
x <- as.data.table(t(t(x) * vec)) 
# using as.data.table because setDT does not work on matrix
(transpose_time = Sys.time() - t0)
# Time difference of 0.614 secs


## Sweep
# reset the data table
x = as.data.table(raw_matrix)
t0 = Sys.time()
setDT(x <- sweep(x, MARGIN = 2, vec, "*"))
(sweep_time = Sys.time() - t0)
# Time difference of 1.81 secs


## Make Matrix method
# reset the data table
x = as.data.table(raw_matrix)
t0 = Sys.time()
setDT(x <- x * matrix(vec, dim(x)[1], length(vec), byrow = TRUE))
(make_matrix_time = Sys.time() - t0)
# Time difference of 1.88 secs

The set method will only work if you want to modify the original data table. If, instead, you want to keep the original and make a modified copy, then Frank's suggested method works well---it's even slightly faster than modifying the original (though it will, of course, require more memory):

##  Create modified copy
z <- setDT(Map(`*`, x, vec))
Gregor Thomas
  • 136,190
  • 20
  • 167
  • 294
  • Fwiw, I get a slightly lower time for ``system.time(z<- setDT(Map(`*`, x, vec)))`` vs `system.time(for (col in 1:ncol(x)) set(x, j = col, value = x[[col]] * vec[col]))`, which also has the benefit (?) of not overwriting x. – Frank Mar 21 '19 at 19:37
  • 1
    Yeah, I'll add that in. I think the choice between the loop with `set` vs `setDT` on `Map` should be made 100% based on whether OP wants to modify the data in place or make a new modified object. – Gregor Thomas Mar 21 '19 at 19:46
  • Thank you for your input and an answer! – gdol Mar 22 '19 at 09:40