13

Consider the following data frame

   x y z
 1 0 0 0
 2 1 0 0
 3 0 1 0
 4 1 1 0
 5 0 0 1
 6 1 0 1
 7 0 1 1
 8 1 1 1
 -------
 x 4 2 1  <--- vector to multiply by 
 

I would like to multiply each column by a seperate value, for example c(4,2,1). Giving:

   x y z
 1 0 0 0
 2 4 0 0
 3 0 2 0
 4 4 2 0
 5 0 0 1
 6 4 0 1
 7 0 2 1
 8 4 2 1

Code:

pw2 <- c(4, 2, 1)
s01  <- seq_len(2) - 1
df  <- expand.grid(x=s01, y=s01, z=s01)
df

for (d in seq_len(3)) df[,d] <- df[,d] * pw2[d]
df

Question: Find a vectorized solution without a for loop (in base R).

Note: that the question Multiply columns in a data frame by a vector is ambiguous because it includes:

  • multiply each row in the data frame column by a different value.
  • multiply each column in the data frame by a different value.

Both queries can be easily solved with a for loop. Here a vectorised solution is explicitly requested.

clp
  • 1,098
  • 5
  • 11
  • 2
    Probably the best of many answers across multiple questions about this exact problem: https://stackoverflow.com/a/65327572/9463489 – jblood94 Apr 12 '23 at 11:42

6 Answers6

11

Use sweep to apply a function on margins of a dataframe:

sweep(df, 2, pw2, `*`)

or with col:

df * pw2[col(df)]

output

  x y z
1 0 0 0
2 4 0 0
3 0 2 0
4 4 2 0
5 0 0 1
6 4 0 1
7 0 2 1
8 4 2 1

For large data frames, check collapse::TRA, which is 10x faster than any other answers (see benchmark):

collapse::TRA(df, pw2, "*")

Benchmark:

bench::mark(sweep = sweep(df, 2, pw2, `*`),
            col = df * pw2[col(df)],
            '%*%' = setNames(
              as.data.frame(as.matrix(df) %*% diag(pw2)), 
              names(df)
            ), 
            TRA = collapse::TRA(df, pw2, "*"), 
            mapply = data.frame(mapply(FUN = `*`, df, pw2)),
            apply = t(apply(df, 1, \(x) x*pw2)), 
            t = t(t(df)*pw2), check = FALSE,
            )

# A tibble: 7 × 13
  expression      min  median itr/s…¹ mem_al…² gc/se…³ n_itr  n_gc total…⁴
  <bch:expr> <bch:tm> <bch:t>   <dbl> <bch:by>   <dbl> <int> <dbl> <bch:t>
1 sweep       346.7µs 382.1µs   2427.   1.23KB   10.6   1141     5 470.2ms
2 col         303.1µs 330.4µs   2760.     784B    8.45  1307     4 473.5ms
3 %*%          72.8µs  77.9µs  11861.     480B   10.6   5599     5 472.1ms
4 TRA             5µs   5.5µs 167050.       0B   16.7   9999     1  59.9ms
5 mapply      117.6µs 127.9µs   7309.     480B   10.6   3442     5 470.9ms
6 apply       107.8µs 117.9µs   7887.   6.49KB   12.9   3658     6 463.8ms
7 t            55.3µs  59.7µs  15238.     720B    8.13  5620     3 368.8ms
Maël
  • 45,206
  • 3
  • 29
  • 67
9

Convert df and pw2 to matrices, use the %*% matrix multiplication operator, then convert back to a dataframe. This will strip the column names, so wrap in setNames() to preserve them.

setNames(
  as.data.frame(as.matrix(df) %*% diag(pw2)), 
  names(df)
)
  x y z
1 0 0 0
2 4 0 0
3 0 2 0
4 4 2 0
5 0 0 1
6 4 0 1
7 0 2 1
8 4 2 1
zephryl
  • 14,633
  • 3
  • 11
  • 30
6

using mapply():

mapply(FUN = `*`, df, pw2)

     x y z
[1,] 0 0 0
[2,] 4 0 0
[3,] 0 2 0
[4,] 4 2 0
[5,] 0 0 1
[6,] 4 0 1
[7,] 0 2 1
[8,] 4 2 1

and as data frame:

data.frame(mapply(FUN = `*`, df, pw2))
  x y z
1 0 0 0
2 4 0 0
3 0 2 0
4 4 2 0
5 0 0 1
6 4 0 1
7 0 2 1
8 4 2 1
S-SHAAF
  • 1,863
  • 2
  • 5
  • 14
6

Another option using apply with transpose like this:

pw2 <- c(4, 2, 1)
t(apply(df, 1, \(x) x*pw2))
#>   x y z
#> 1 0 0 0
#> 2 4 0 0
#> 3 0 2 0
#> 4 4 2 0
#> 5 0 0 1
#> 6 4 0 1
#> 7 0 2 1
#> 8 4 2 1

Created on 2023-04-10 with reprex v2.0.2

Quinten
  • 35,235
  • 5
  • 20
  • 53
5

Here is another option where you turn the vector into a matrix the same dimensions as your data frame and then simply multiply the two:

t(replicate(nrow(df), pw2)) * df

Output

  x y z
1 0 0 0
2 4 0 0
3 0 2 0
4 4 2 0
5 0 0 1
6 4 0 1
7 0 2 1
8 4 2 1
LMc
  • 12,577
  • 3
  • 31
  • 43
5

The existing mapply approach among all answers look great but I believe we can achieve more efficiency if we use Map + list2DF instead (specially when you prefer to stay with base R)


Below is a benchmark for mapply and Map variants

microbenchmark(
  "mapply1" = data.frame(mapply(FUN = `*`, df, pw2)),
  "mapply2" = as.data.frame(mapply(FUN = `*`, df, pw2)),
  "Map1" = list2DF(Map(`*`, df, pw2)),
  "Map2" = list2DF(Map(`*`, df, as.list(pw2)))
)

gives

Unit: microseconds
    expr  min    lq    mean median     uq   max neval
 mapply1 74.6 78.60 112.163  97.05 140.50 342.6   100
 mapply2 34.6 38.20  55.513  42.70  67.40 313.5   100
    Map1 23.8 25.25  33.728  27.60  41.30 113.8   100
    Map2 25.9 28.75  40.866  32.95  47.65 238.6   100

Also, let the Map approach join the benchmarking party as provided by @Maël, e.g.,

bc <- bench::mark(
  sweep = sweep(df, 2, pw2, `*`),
  col = df * pw2[col(df)],
  "%*%" = setNames(
    as.data.frame(as.matrix(df) %*% diag(pw2)),
    names(df)
  ),
  TRA = collapse::TRA(df, pw2, "*"),
  mapply1 = data.frame(mapply(FUN = `*`, df, pw2)),
  mapply2 = as.data.frame(mapply(FUN = `*`, df, pw2)),
  Map1 = list2DF(Map(`*`, df, pw2)),
  Map2 = list2DF(Map(`*`, df, as.list(pw2))),
  apply = t(apply(df, 1, \(x) x * pw2)),
  t = t(t(df) * pw2),
  check = FALSE,
)

we will see that Map is in the second place in terms of efficiency

# A tibble: 10 × 13
   expression      min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc
   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>
 1 sweep       201.7µs  249.2µs     3526.  101.24KB     12.6  1680     6
 2 col         174.9µs  225.6µs     3637.    9.02KB     10.4  1748     5
 3 %*%          45.4µs   52.9µs    17026.   36.95KB     12.5  8158     6
 4 TRA           3.4µs    3.8µs   226089.  905.09KB     22.6  9999     1
 5 mapply1      71.6µs   78.4µs    11958.      480B     14.7  5681     7
 6 mapply2      33.1µs   37.4µs    25339.      480B     17.7  9993     7
 7 Map1         22.5µs   26.1µs    35649.        0B     17.8  9995     5
 8 Map2         25.3µs   29.4µs    31785.        0B     19.1  9994     6
 9 apply        70.2µs   80.7µs    11684.   11.91KB     14.7  5562     7
10 t            34.8µs   40.2µs    23608.    3.77KB     14.2  9994     6
# ℹ 5 more variables: total_time <bch:tm>, result <list>, memory <list>,
#   time <list>, gc <list>

and autoplot(bc) shows

enter image description here

ThomasIsCoding
  • 96,636
  • 9
  • 24
  • 81