27

I'm trying to multiply a data frame df by a vector v, so that the product is a data frame, where the i-th row is given by df[i,]*v. I can do this, for example, by

df <- data.frame(A=1:5, B=2:6); v <- c(0,2)
as.data.frame(t(t(df) * v))
   A  B
1  0  4
2  0  6
3  0  8
4  0 10
5  0 12

I am sure there has to be a more R-style approach (and a very simple one!), but nothing comes on my mind. I even tried something like

apply(df, MARGIN=1, function(x) x*v)

but still, non-readable constructions like as.data.frame(t(.)) are required.
How can I find an efficient and elegant workaround here?

tonytonov
  • 25,060
  • 16
  • 82
  • 98
  • 3
    Why does it need to be a data.frame? If you have all numeric elements it generally makes more sense to use a matrix. – Señor O Aug 22 '13 at 16:42

6 Answers6

38

This works too:

data.frame(mapply(`*`,df,v))

In that solution, you are taking advantage of the fact that data.frame is a type of list, so you can iterate over both the elements of df and v at the same time with mapply.

Unfortunately, you are limited in what you can output from mapply: as simple list, or a matrix. If your data are huge, this would likely be more efficient:

data.frame(mapply(`*`,df,v,SIMPLIFY=FALSE))

Because it would convert it to a list, which is more efficient to convert to a data.frame.

nograpes
  • 18,623
  • 1
  • 44
  • 67
  • This is a great line of code, and it seems to be the most efficient as well. Not quite self-explaining in the code, but very neat, compared to my solution. +1 for further optimization! – tonytonov Aug 22 '13 at 14:43
  • @Arun I thought you were right, eddi's answer it seems to show that it is much slower. Perhaps the matrix generation takes longer than you think? – nograpes Aug 22 '13 at 16:43
12

If you're looking for speed and memory efficiency - data.table to the rescue:

library(data.table)
dt = data.table(df)

for (i in seq_along(dt))
  dt[, (i) := dt[[i]] * v[i]]


eddi = function(dt) { for (i in seq_along(dt)) dt[, (i) := dt[[i]] * v[i]] }
arun = function(df) { df * matrix(v, ncol=ncol(df), nrow=nrow(df), byrow=TRUE) }
nograpes = function(df) { data.frame(mapply(`*`,df,v,SIMPLIFY=FALSE)) }

N = 1e6
dt = data.table(A = rnorm(N), B = rnorm(N))
v = c(0,2)

microbenchmark(eddi(copy(dt)), arun(copy(dt)), nograpes(copy(dt)), times = 10)
#Unit: milliseconds
#               expr       min        lq      mean    median        uq       max neval
#     eddi(copy(dt))  23.01106  24.31192  26.47132  24.50675  28.87794  34.28403    10
#     arun(copy(dt)) 337.79885 363.72081 450.93933 433.21176 516.56839 644.70103    10
# nograpes(copy(dt))  19.44873  24.30791  36.53445  26.00760  38.09078  95.41124    10

As Arun points out in the comments, one can also use the set function from the data.table package to do this in-place modification on data.frame's as well:

for (i in seq_along(df))
  set(df, j = i, value = df[[i]] * v[i])

This of course also works for data.table's and could be significantly faster if the number of columns is large.

eddi
  • 49,088
  • 6
  • 104
  • 155
  • 1
    +1 nice! The documentation states that using `set` with `for-loop` would be faster because there's no overhead due to `[.data.table`. However, here, I don't see it being faster.. any idea? Also, `set` can be used with `data.frame`. You don't have to convert to a `data.table` (and assigning happens by reference)! – Arun Aug 22 '13 at 18:53
  • good point about set, but since, I assume, the number of columns is small, I don't think for loop vs set is going to make a difference (if the number of columns is large enough for it to matter, I think `data.table` is not a good data structure any more at that point); also in my world there are no conversions to `data.table` as everything is `data.table` to begin with ;) – eddi Aug 22 '13 at 19:08
  • yes. What I meant (about *any idea*) was, `set` is *slower*... I cant explain why it's slower... – Arun Aug 22 '13 at 19:18
  • @Arun got it; you're right - I just checked and `set` is indeed slower for 2 columns - *but* becomes better as the number of columns increases - it's actually 2x as fast for 10 columns – eddi Aug 22 '13 at 19:21
  • +1 Btw, we're starting to favour `dt[, (i) := dt[[i]] * v[i]]` so we don't need `with=FALSE`. The `with=FALSE` may be confusing as to what the `with` refers to. `(i)` is still confusing maybe, but at least the reader knows it's something to do with `i`. – Matt Dowle Sep 02 '13 at 19:39
  • Might want to update this for two reasons. First, the data.table call is deprecated. Second, I tried with `classicR=function(df){ as.matrix(df) %*% v}` and got results only about half as fast as the data.table version--a lot better than arun and nograpes, both of which are also more complicated. – pdb May 31 '18 at 18:57
  • @PaulBailey updated to new syntax, but your `classicR` function doesn't work as is – eddi Jun 01 '18 at 16:33
8

A language that lets you combine vectors with matrices has to make a decision at some point whether the matrices are row-major or column-major ordered. The reason:

> df * v
  A  B
1 0  4
2 4  0
3 0  8
4 8  0
5 0 12

is because R operates down the columns first. Doing the double-transpose trick subverts this. Sorry if this is just explaining what you know, but I don't know another way of doing it, except explicitly expanding v into a matrix of the same size.

Or write a nice function that wraps the not very R-style code into something that is R-stylish.

Spacedman
  • 92,590
  • 12
  • 140
  • 224
  • The flexibility of R is what we love it for, that's so true. Thanks for the comment, I think the solution will be to wrap this into a function in order to preserve code readability. – tonytonov Aug 22 '13 at 14:36
4

Whats wrong with

t(apply(df, 1, function(x)x*v))

?

Fernando
  • 7,785
  • 6
  • 49
  • 81
  • This returns a matrix instead of a data.frame, so it would be `data.frame(t(apply(df, 1, function(x)x*v)))` which is less concise than @nograpes ' answer `data.frame(mapply(`*`,df,v))`. – Rob Aug 22 '13 at 14:31
  • Thanks for pointing that out, Rob. The question of efficiency is still open though. – tonytonov Aug 22 '13 at 14:38
  • `mapply` indeed also appears to be faster: On a `data.frame` with 1000000 rows it takes 4.42 sec using `mapply` vs. 12.52 sec using `apply` on my system. – Rob Aug 22 '13 at 14:43
4
library(purrr)

map2_dfc(df, v, `*`)

Benchmark

N = 1e6
dt = data.table(A = rnorm(N), B = rnorm(N))
v = c(0,2)

eddi = function(dt) { for (i in seq_along(dt)) dt[, (i) := dt[[i]] * v[i]]; dt }
arun = function(df) { df * matrix(v, ncol=ncol(df), nrow=nrow(df), byrow=TRUE) }
nograpes = function(df) { data.frame(mapply(`*`,df,v,SIMPLIFY=FALSE)) }
ryan = function(df) {map2_dfc(df, v, `*`) }
library(microbenchmark)
microbenchmark(
  eddi(copy(dt))
  , arun(copy(dt))
  , nograpes(copy(dt))
  , ryan(copy(dt))
  , times = 100)


# Unit: milliseconds
# expr                     min        lq      mean    median        uq      max neval
# eddi(copy(dt))      8.367513  11.06719  24.26205  12.29132  19.35958 171.6212   100
# arun(copy(dt))     94.031272 123.79999 186.42155 148.87042 251.56241 364.2193   100
# nograpes(copy(dt))  7.910739  10.92815  27.68485  13.06058  21.39931 172.0798   100
# ryan(copy(dt))      8.154395  11.02683  29.40024  13.73845  21.77236 181.0375   100
IceCreamToucan
  • 28,083
  • 2
  • 22
  • 38
1

I think the fastest way (without testing data.table) is data.frame(t(t(df)*v)).

My tests:

testit <- function(nrow, ncol)
{
    df <- as.data.frame(matrix(rnorm(nrow*ncol),nrow=nrow,ncol=ncol))

    v <- runif(ncol)

    r1 <- data.frame(t(t(df)*v))
    r2 <- data.frame(mapply(`*`,df,v,SIMPLIFY=FALSE))
    r3 <- df * rep(v, each=nrow(df))

    stopifnot(identical(r1, r2) && identical(r1, r3))

    microbenchmark(data.frame(t(t(df)*v)), data.frame(mapply(`*`,df,v,SIMPLIFY=FALSE)), df * rep(v, each=nrow(df)))
}

Result

> set.seed(1)
> 
> testit(100,100)
Unit: milliseconds
                                             expr       min        lq    median        uq      max neval
                         data.frame(t(t(df) * v))  2.297075  2.359541  2.455778  3.804836 33.05806   100
 data.frame(mapply(`*`, df, v, SIMPLIFY = FALSE))  9.977436 10.401576 10.658964 11.762009 15.09721   100
                     df * rep(v, each = nrow(df)) 14.309822 14.956705 16.092469 16.516609 45.13450   100
> testit(1000,10)
Unit: microseconds
                                             expr      min       lq   median       uq      max neval
                         data.frame(t(t(df) * v))  754.844  805.062  844.431 1850.363 27955.79   100
 data.frame(mapply(`*`, df, v, SIMPLIFY = FALSE)) 1457.895 1497.088 1567.604 2550.090  4732.03   100
                     df * rep(v, each = nrow(df)) 5383.288 5527.817 5875.143 6628.586 32392.81   100
> testit(10,1000)
Unit: milliseconds
                                             expr       min        lq    median        uq       max neval
                         data.frame(t(t(df) * v))  17.07548  18.29418  19.91498  20.67944  57.62913   100
 data.frame(mapply(`*`, df, v, SIMPLIFY = FALSE))  99.90103 104.36028 108.28147 114.82012 150.05907   100
                     df * rep(v, each = nrow(df)) 112.21719 118.74359 122.51308 128.82863 164.57431   100
Ferdinand.kraft
  • 12,579
  • 10
  • 47
  • 69
  • you're looking at tiny data (where those differences don't matter unless you're doing loops) - look at e.g. `testit(100000,10)` - not super large and shaped like data is usually shaped – eddi Aug 22 '13 at 17:28
  • @eddi, interesting. But transposing twice is still in the same order of mapply for 1e6. rows Actually it's about 5% faster in my run. – Ferdinand.kraft Aug 22 '13 at 17:34