0

I'm looking to optimize (readability, speed) the following code, which simulates a draft based on assigned probabilities and then counts the occurrences of each unique value in each row of a matrix:

# toy example values
random_probs <- rbeta(26, 10, 20)
random_probs <- random_probs/sum(random_probs)
player_names <- letters

draft_simulations <- 
  replicate(1000, sample(1:26, 26, replace = FALSE, prob = random_probs)) %>%
  matrix(., nrow = 26, ncol = 1000)

list_sims <- map(1:26, ~rowSums(draft_simulations == .))

as_tibble(do.call(rbind, list_sims)) %>%
  mutate(Skater = player_names) %>%
  select(Skater, everything())
user438383
  • 5,716
  • 8
  • 28
  • 43
spazznolo
  • 747
  • 3
  • 9

1 Answers1

1

Overview

I think you have two main options about the logic behind the problem:

  1. For each option o in 1-26, build an index matrix draft_simulations == o, then rowSums it;
  2. For each player p in 1-26, table its row table(draft_simulations[p,]).

On top of that choice, I'd say you have two main possible efficiency gains:

  1. Use better functions;
  2. Use a better looping structure.

Better functions

For (1.) I think rowSums and == are about as good as it gets. But for table, there are a lot of different options, as you can see on Is there an efficient alternative to table()?. I'd recommend reading all the answers to that question, as each have their pros and cons, but here, I'll stick to base::tabulate.

Better looping structure

You have for loops, the base::apply family, and purrr::map. Is not obvious which to use, so I'll test below.

Testing

I tested options using rowSums, table, and tabulate, with all the looping options (code at the end), which yielded:

bench::mark(rowsums_for(), rowsums_sapply(), rowsums_map(),
            table_for(), table_apply(),
            tabulate_for(), tabulate_apply(),
            check = FALSE, min_time = 1) %>%
  select(expression:`gc/sec`)

  expression            min   median `itr/sec` mem_alloc `gc/sec`
  <bch:expr>       <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
1 rowsums_for()      3.08ms      4ms      260.    2.59MB     9.04
2 rowsums_sapply()   3.14ms   4.13ms      251.     2.6MB     9.02
3 rowsums_map()      3.27ms    4.3ms      241.    2.59MB     9.01
4 table_for()        4.41ms   4.73ms      208.    1.87MB     5.33
5 table_apply()      4.59ms   4.96ms      200.    1.98MB     5.33
6 tabulate_for()    231.2µs  250.7µs     3788.  210.94KB    11.9 
7 tabulate_apply()  289.2µs    311µs     3090.   312.8KB    11.9 

So, at least for the dimension of your problem, for loops are better. Also, rowSums is better than table, but tabulate is the clear winner. This isn't necessarily true if the dimensions are different.

Functions:

plays <- 26
players <- 26
n_sims <- 1000

random_probs <- rbeta(plays, 10, 20)
random_probs <- random_probs/sum(random_probs)

draft_simulations <-
  replicate(n_sims, sample(1:plays, players, replace = FALSE, prob = random_probs)) %>%
  matrix(nrow = players, ncol = n_sims)


rowsums_for <- function(){
  result <- matrix(nrow = players, ncol = plays)
  for(i in 1:plays){
    result[,i] <- rowSums(draft_simulations == i)
  }
}

rowsums_sapply <- function(){
  sapply(1:players, function(x) rowSums(draft_simulations == x))
}

rowsums_map <- function(){
  map(1:players, ~ rowSums(draft_simulations == .x))
}

table_for <- function(){
  result <- matrix(nrow = players, ncol = plays)
  for(i in 1:players){
    result[i,] <- table(draft_simulations[i,])
  }
}

table_apply <- function(){
  apply(draft_simulations, 1, table)
}

tabulate_for <- function(){
  result <- matrix(nrow = players, ncol = plays)
  for(i in 1:players){
    result[i,] <- tabulate(draft_simulations[i,])
  }
}

tabulate_apply <- function(){
  apply(draft_simulations, 1, tabulate)
}