1

I want to run a sim that randomly picks rows and adds up the total value of the rows based on a set of rules. I'm new to simulations so don't know where to start.

Rules: 9 total rows picked per sim. Each sim of 9 must include the following number of "positions":

QB: 1
RB: 2
WR: 3
TE: 1
K: 1
DST: 1

I want each sim to add up the value of the group (WAR column) and the output to show percentage each player made say the top 10 percent of groups with the highest WAR. Hopefully this makes some sense. The ultimate goal here is to ID which players were most likely to be successful.

Here is a dput of ten top players from each positions as example.

dput

    structure(list(player = c("Justin Tucker", "Harrison Butker", 
    "Wil Lutz", "Greg Zuerlein", "Matt Gay", "Brandon McManus", "Jake Elliott", 
    "Robbie Gould", "Stephen Hauschka", "Dan Bailey", "Patrick Mahomes", 
    "Lamar Jackson", "Dak Prescott", "Russell Wilson", "Kyler Murray", 
    "Deshaun Watson", "Matt Ryan", "Josh Allen", "Tom Brady", "Carson Wentz", 
    "Christian McCaffrey", "Saquon Barkley", "Ezekiel Elliott", "Alvin Kamara", 
    "Dalvin Cook", "Clyde Edwards-Helaire", "Derrick Henry", "Miles Sanders", 
    "Joe Mixon", "Josh Jacobs", "Travis Kelce", "George Kittle", 
    "Mark Andrews", "Zach Ertz", "Darren Waller", "Evan Engram", 
    "Hayden Hurst", "Tyler Higbee", "Hunter Henry", "Mike Gesicki", 
    "Michael Thomas", "Davante Adams", "Julio Jones", "Tyreek Hill", 
    "DeAndre Hopkins", "Chris Godwin", "Kenny Golladay", "Allen Robinson", 
    "DJ Moore", "Odell Beckham"), adp = c(3, 3, 2, 2, 1, 1, 1, 1, 
    1, 1, 26, 23, 12, 11, 10, 9, 5, 4, 4, 4, 66, 57, 53, 50, 45, 
    43, 41, 40, 40, 39, 29, 26, 18, 15, 10, 8, 7, 6, 4, 4, 48, 40, 
    38, 37, 36, 34, 29, 27, 27, 27), WAR = c(0.27, 0.27, 0.1, 0.23, 
    0.09, 0.19, -0.83, -0.3, -0.1, -0.62, 2.26, 1.41, 0.91, 1.7, 
    2.28, 1.74, 0.28, 2.29, 1.12, 0.06, 1.02, -0.05, 1.36, 3.57, 
    3.48, 1.04, 2.91, 1.13, 0.69, 1.49, 2.79, 0.71, 0.85, -0.22, 
    1.67, 0.07, 0.26, 0.06, 0.35, 0.64, -0.04, 2.74, 0.63, 2.35, 
    1.49, 0.49, 0.33, 1.17, 0.61, 0.28), position = c("K", "K", "K", 
    "K", "K", "K", "K", "K", "K", "K", "QB", "QB", "QB", "QB", "QB", 
    "QB", "QB", "QB", "QB", "QB", "RB", "RB", "RB", "RB", "RB", "RB", 
    "RB", "RB", "RB", "RB", "TE", "TE", "TE", "TE", "TE", "TE", "TE", 
    "TE", "TE", "TE", "WR", "WR", "WR", "WR", "WR", "WR", "WR", "WR", 
    "WR", "WR")), row.names = c(NA, -50L), groups = structure(list(
    position = c("K", "QB", "RB", "TE", "WR"), .rows = structure(list(
        1:10, 11:20, 21:30, 31:40, 41:50), ptype = integer(0), class = c("vctrs_list_of", 
    "vctrs_vctr", "list"))), row.names = c(NA, -5L), class = c("tbl_df", 
    "tbl", "data.frame"), .drop = TRUE), class = c("grouped_df", 
    "tbl_df", "tbl", "data.frame"))
Bruno
  • 4,109
  • 1
  • 9
  • 27
Jeff Henderson
  • 643
  • 6
  • 10

1 Answers1

1

One idea is you could use a lookup table to set the number of samples per group, then create a function to run a "simulation" by sampling n_samples from each group. Not exactly sure what you are after with the sum of WAR, but once you have the simulations manipulation like grouped sums should be straightforward.

Note there are no "DST" positions in your sample data so each simulation only comes out with 8.

library(tidyverse)

# lookup table
df_sample <- data.frame(position = c("K", "QB", "RB", "TE", "WR", "DST"),
                        n_samples =   c(1,     1,    2,   1,    3,    1))


df_nest <- df %>%
  left_join(df_sample) %>%
  group_by(position, n_samples) %>%
  nest

run_sim <- function(nested_df = df_nest){
  nested_df %>%
    mutate(sim = map2(data, n_samples, sample_n)) %>%
    ungroup() %>%
    select(-data, -n_samples) %>%
    unnest(sim)
}


map_dfr(1:10, ~run_sim(df_nest), .id = 'sim')

#----
# A tibble: 80 x 5
   sim   position player             adp   WAR
   <chr> <chr>    <chr>            <dbl> <dbl>
 1 1     K        Dan Bailey           1 -0.62
 2 1     QB       Patrick Mahomes     26  2.26
 3 1     RB       Miles Sanders       40  1.13
 4 1     RB       Joe Mixon           40  0.69
 5 1     TE       Evan Engram          8  0.07
 6 1     WR       Julio Jones         38  0.63
 7 1     WR       Michael Thomas      48 -0.04
 8 1     WR       DeAndre Hopkins     36  1.49
 9 2     K        Stephen Hauschka     1 -0.1 
10 2     QB       Russell Wilson      11  1.7 
# ... with 70 more rows
nniloc
  • 4,128
  • 2
  • 11
  • 22
  • This is marvelous! And yes, sorry, I had just noticed DST was not actually in that data set I pulled. This is perfect! Would you recommend I do something like doparllel if I want to do a ton of sims? could you help me with how to code for that? – Jeff Henderson Jul 23 '21 at 20:19
  • Check out the [furrr](https://github.com/DavisVaughan/furrr) package. Its basically a 1:1 replacement for the `purrr` functions in the code above. – nniloc Jul 23 '21 at 20:21
  • Again, marvelous. Went from 17 sec sim of 1000 to 7 seconds using 4 cores (or threads? I forget how that works). You have been precisely helpful! – Jeff Henderson Jul 23 '21 at 20:48
  • I have 6 cores and 12 local processors. How many safe "workers" should I be able to use? @nniloc – Jeff Henderson Jul 23 '21 at 20:50
  • This is getting outside of my wheelhouse. Some interesting discussion here: https://stackoverflow.com/q/28954991/12400385 – nniloc Jul 23 '21 at 21:21
  • thank you. I have an additional add on desired. Let's say I only want sims in which the cumulative value of the adp column is between 195 and 200 (in the full data set there is much bigger spread than with the dput). Can this be worked in somehow? I can filter after the sim but figured this would allow me to get more sims faster if built into it already. – Jeff Henderson Jul 24 '21 at 04:23
  • One idea is you could put an `if/else` statement in the function. Sample the players, then check if the sum is within your range. If it is, return the data frame, if not, return `NULL`. – nniloc Jul 24 '21 at 22:26