0

I am a beginner with R and I'm having a hard time finding any info related to the task at hand. Basically, I am trying to calculate 5- year averages for 15 categories of NHL player statistics and add each of the results as a column in the '21-22 Player Data. So, for example, I'd want 5-year averages for a player (ex. Connor McDavid) to be displayed in the same dataset as the 21-22 player data, but each of the numbers needed to calculate the mean lives in its own spreadsheet that has been imported into R. I have an .xlsx worksheet for each year from 17-18 to 21-22 so 5 sheets in total. I have loaded each of the sheets in to Rstudio, but the next steps are very difficult for me to figure out.

what i have now

desired output

Rstudio uploaded datasets

I think I have to use a pipe, locate one specific cell (ex. Connor McDavid, goals) in 5 different data frames, use a mean function to find the average for this one particular cell (ex. Connor McDavid, goals), assign that as a vector 5_year_average_goals, then add that vector as a column in the original 21-22 dataset so I can compare the production for each player last season to their 5-year averages. Then repeat that step for each column (assists, points, etc.) Would I have to repeat these steps for each player (row)? Is there an easy way to use a placeholder that will calculate these averages for every player in the 21-22 dataset?

r2evans
  • 141,215
  • 6
  • 77
  • 149
Ken Hunt
  • 1
  • 1
  • 5
    It looks like you're new to SO; welcome to the community! If you want great answers quickly, it's best to make your question reproducible. Taking pictures of data is unlikely to get you an answer. Includes sample data using the output from `dput(head(dataObject))`. If it's an import issue, than a link to a spreadsheet. You need to include any code that you've tried to use, as well. Check it out: [making R reproducible questions](https://stackoverflow.com/q/5963269). – Kat Jul 23 '22 at 16:50

1 Answers1

0

This is a guess, and just a start ...

I suggest as you're learning that using dplyr and friends might make some of this simpler.

library(dplyr)
files <- list.files(pattern="xlsx$")
lapply(setNames(nm = files), read_xlsx) %>%
  bind_rows(.id = "filename") %>%
  arrange(filename) %>%
  group_by(`Player Name`) %>%
  mutate(across(c(GPG, APG), ~ cumsum(.) / seq_along(.), .names = "{.col}_5yr_avg"))

Brief walk-through:

  • list.files(.) should produce a character vector of all of the filenames. If you need it to be a subset, filter it as needed. If they are in a different directory, then files <- list.files("path/to/dir", pattern="xlsx$", full.names=TRUE).
  • lapply(..) reads the xlsx file for each of the filenames found by list.files(.), returns a list of data.frames.
  • bind_rows(.) combines all of the nested frames into a single frame, and adds the names of the list as a new column, filename, which will contain the file from which each row was extracted.
  • arrange(.) sorts by the year, which should put things in chronological order, required for a running average. I'm assuming that the files sort correctly, you may need to adjust this if I got it wrong.
  • group_by(..) makes sure that the next expressions only see on player at a time (across all files).
  • mutate calculates (I believe) the running average over the years. It's not perfectly resilient to issues (e.g., gaps in years), but it's a good start.

Hope this helps.

r2evans
  • 141,215
  • 6
  • 77
  • 149