0

I have current season data on NBA players for over 200 predictors (stats) for each player. up to this point in this season only (e.g., points per game on average for this season). I'd like to run a stepwise regression model by team or player, given that the predictor set with the most explanatory value will vary by team or player. Ultimately, I would like to then predict future player performance, so I would like to be able to access the model components (i.e., the coefficients, r-squared, etc.). What would be the best way to automate stepwise regression by group (be it player or team)? The variables (outcome and predictors) are all continuous, so it is essentially just lm models but I am trying to avoid writing the model for every team if there is a way to loop through teams or use group by or some similar function. To do a stepwise regression for each of 3 teams, I could manually do this for each team:

df<-data.frame(team=c(sample(1:3, 100, replace = TRUE)),
               x1=c(rnorm(100,mean=0,sd=1)),
               x2=c(rnorm(100,mean=0,sd=1)),
               y=c(rnorm(100,mean=0,sd=1)))

model1_empty<-lm(y~1,data=subset(df,team==1))
model1_full<-lm(y~. - team,data=subset(df,team==1))
model1_step<-step(model1_empty, scope = list(lower = model1_empty, upper = model1_full), direction = "forward")

model2_empty<-lm(y~1,data=subset(df,team==2))
model2_full<-lm(y~. - team,data=subset(df,team==2))
model2_step<-step(model2_empty, scope = list(lower = model2_empty, upper = model2_full), direction = "forward")

model3_empty<-lm(y~1,data=subset(df,team==3))
model3_full<-lm(y~. - team,data=subset(df,team==3))
model3_step<-step(model3_empty, scope = list(lower = model3_empty, upper = model3_full), direction = "forward")


I am curious about whether there is more of an automated way to do this for 30 teams, or for 200 players.

hrprof
  • 1
  • 1
  • Welcome to SO, WayneCrawford! Questions on SO (especially in R) do much better if they are reproducible and self-contained. By that I mean including attempted code (please be explicit about non-base packages), sample representative data (perhaps via `dput(head(x))` or building data programmatically (e.g., `data.frame(...)`), possibly stochastically after `set.seed(1)`), perhaps actual output (with verbatim errors/warnings) versus intended output. Refs: https://stackoverflow.com/q/5963269, [mcve], and https://stackoverflow.com/tags/r/info.pro – r2evans Feb 01 '21 at 03:40
  • If you're building a predictive model I'd strongly suggest using something like the Lasso with cross validation for variable selection instead of this - I think you'll get a bunch of overfit junk this way. – Gregor Thomas Feb 01 '21 at 03:58
  • 1
    @r2evans Thank you for the comment - I've updated my question. – hrprof Feb 01 '21 at 04:25
  • @GregorThomas thank you - I will look at doing Lasso with cross validation (I'm clearly new at ML). A question for you: if I'm interested in predicting a player's stats for the next game they will play, I was planning to use regression by player or team and look for those with the highest r2 values....in general, I am less interested in which variables were most predictive for a given player - and more interested in who I have the best ability to predict (I presume the model will ultimately point out which players are most consistent, essentially). Would you still suggest Lasso? – hrprof Feb 01 '21 at 04:29
  • Yes - more or less. If there are time-series type features you are using, then you may need to account for that in your cross-validation, i.e., have the first 1/2 of the season in your training set and the second half in your validation set. (And maybe the last couple games in your test set.) – Gregor Thomas Feb 01 '21 at 04:34

0 Answers0