3

One has a matrix Lambda with p columns and n rows and for each row wants assign all the values to 0 except the value in the first column and the maximum of the values in the other columns (in that sense all the p - 2 minimum values after avoiding the first column).

For the moment I am doing this with a for loop, like follows:

set.seed(60)
(Lambda = matrix(sample.int(30),5))
     [,1] [,2] [,3] [,4] [,5] [,6]
[1,]   19   20   27   18   15   25
[2,]   16   28    1    4   22    7
[3,]    2   10    8   23    3   12
[4,]    5    6    9   17   11   29
[5,]   26   30   24   13   14   21

m <- ncol(Lambda) - 2
for(ir in seq_len(nrow(Lambda))){
    Lambda[ir, match(tail(sort(abs(Lambda[ir, 2:ncol(Lambda)]), decreasing = TRUE), m), Lambda[ir,])] <- 0
}
Lambda
     [,1] [,2] [,3] [,4] [,5] [,6]
[1,]   19    0   27    0    0    0
[2,]   16   28    0    0    0    0
[3,]    2    0    0   23    0    0
[4,]    5    0    0    0    0   29
[5,]   26   30    0    0    0    0

Fine, one gets the goal, but if there were many rows it would become a problem. Is there a solution not using a for loop? It could be with lapply but I'm not sure if it would be really efficient. Maybe with data.table after converting the matrix?

Thank you!

markus
  • 25,843
  • 5
  • 39
  • 58
iago
  • 2,990
  • 4
  • 21
  • 27
  • How big is the actual matrix you are going to use this one? – s_baldur Oct 16 '20 at 17:01
  • Actually my matrix is not so big (any case, less than 1 million rows). It is not a problem for me but I would prefer to solve this question in a better way – iago Oct 16 '20 at 17:04

3 Answers3

3

So concerning

Is there a solution not using a for loop.

For some algorithms, you just have to write a for loop. And that's okay! Swapping a for loop for something like lapply is not really a performance improvement (see https://stackoverflow.com/a/42440872/4917834).

It is possible to speed up your code though:

# your example
set.seed(60)
Lambda = matrix(sample.int(30),5)

original <- function(Lambda) {
  m <- ncol(Lambda) - 2
  for (ir in seq_len(nrow(Lambda))){
    Lambda[ir, match(tail(sort(abs(Lambda[ir, 2:ncol(Lambda)]), decreasing = TRUE), m), Lambda[ir,])] <- 0
  }
  Lambda
}
original(Lambda)

# a faster alternative
proposal <- function(Lambda) {

  nc <- ncol(Lambda)
  for (i in seq_len(nrow(Lambda))) {
    m <- which.max(abs(Lambda[i, -1L]))
    Lambda[i, (2:nc)[-m]] <- 0
  }
  Lambda
}
proposal(Lambda)

Let's benchmark the two approaches:

bch <- bench::mark(
  original(Lambda),
  proposal(Lambda)
)
summary(bch, relative = TRUE)
# A tibble: 2 x 13
  expression              min median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc 
  <bch:expr>            <dbl>  <dbl>     <dbl>     <dbl>    <dbl> <int> <dbl> 
1 original(Lambda)       25.7   24.1       1           1     1     1447     4
2 proposal(Lambda)        1      1        23.6         1     2.57  9997     3

So proposal is about 24 times faster than your original solution (median time for original is 313.8µs, for proposal it's 13.1µs). If that's not fast enough it might be worthwhile to look for a package that has implemented this in C or C++. I played around with matrixStats but no luck. Alternatively, you could port this to C++ with Rcpp which should also speed up the code.

Vandenman
  • 3,046
  • 20
  • 33
3

Here is one option that is more than 15% seems a bit faster than proposal() on 600k rows:

foo <- function(Lambda) {
  nr <- nrow(Lambda)
  keep <- c(seq_len(nr), apply(Lambda[, -1], 1, which.max)*nr + seq_len(nr))
  replace(Lambda, -keep, 0L)
}

Edit

A vast improvment is replacing the apply() + which.max() combo with max.col() as suggested by markus:

foo2 <- function(Lambda) {
  nr <- nrow(Lambda)
  keep <- c(seq_len(nr), max.col(Lambda[, -1], ties.method = "first")*nr + seq_len(nr))
  replace(Lambda, -keep, 0L)  
}

(Updated) Benchmark:

set.seed(60)
Lambda = matrix(sample.int(36e5), ncol = 6)
bench::mark(
  foo(Lambda),
  proposal(Lambda),
  foo2(Lambda),
  relative = TRUE
)[1:5]

  expression         min median `itr/sec` mem_alloc
  <bch:expr>       <dbl>  <dbl>     <dbl>     <dbl>
1 foo(Lambda)       17.7   12.1      1.09      3.67
2 proposal(Lambda)  19.3   13.1      1         1   
3 foo2(Lambda)       1      1       13.5       3.75
s_baldur
  • 29,441
  • 4
  • 36
  • 69
  • 2
    Your already good approach can be further improved if you use `max.col(Lambda[,-1], ties.method = "first")` instead of `apply(Lambda[, -1], 1, which.max)` – markus Oct 16 '20 at 21:08
  • @markus wow the difference is night and day. Thank you, I had forgot about `max.col()`. – s_baldur Oct 17 '20 at 15:23
  • Thank you very much both sindri_baldur and @markus. – iago Oct 19 '20 at 09:23
1

A vectorized version that can think of:

vectorized <- function(Lambda) {
  max <- matrixStats::rowMaxs(Lambda, cols = -1)
  noreplace <- sweep(Lambda, 1, max, "==")
  noreplace[, 1] <- TRUE
  Lambda * noreplace
}

But it is not faster than the for loop by @Vandenman.

F. Privé
  • 11,423
  • 2
  • 27
  • 78
  • Thank you very much! It works great. Actually, benchmarks with the size of the one by sindri_baldur or bigger, tell that your function is almost as faster as `foo2` by him and markus – iago Oct 19 '20 at 09:26