0

I am using NBA shot data and am attempting to create shot prediction models using different regression techniques. However, I am running into the following warning message when trying to use a logistic regression model: Warning message: glm.fit: algorithm did not converge. Also, it seems that the predictions do not work at all (not changed from the original Y variable (make or miss)). I will provide my code below. I got the data from here: Shot Data.

nba_shots <- read.csv("shot_logs.csv")
library(dplyr)
library(ggplot2)
library(data.table)
library("caTools")
library(glmnet)
library(caret)

nba_shots_clean <- data.frame("game_id" = nba_shots$GAME_ID, "location" = 
nba_shots$LOCATION, "shot_number" = nba_shots$SHOT_NUMBER, 
                    "closest_defender" = nba_shots$CLOSEST_DEFENDER,
                    "defender_distance" = nba_shots$CLOSE_DEF_DIST, "points" = nba_shots$PTS, 
                    "player_name" = nba_shots$player_name, "dribbles" = nba_shots$DRIBBLES,
                    "shot_clock" = nba_shots$SHOT_CLOCK, "quarter" = nba_shots$PERIOD,
                    "touch_time" = nba_shots$TOUCH_TIME, "game_result" = nba_shots$W
                    , "FGM" = nba_shots$FGM)

mean(nba_shots_clean$shot_clock) # NA
# this gave NA return which means that there are NAs in this column that we 
# need to clean up
# if the shot clock was NA I assume that this means it was the end of a 
# quarter and the shot clock was off.
# For now I'm going to just set all of these NAs equal to zero, so all zeros 
# mean it is the end of a quarter
# checking the amount of NAs
last_shots <- nba_shots_clean[is.na(nba_shots_clean$shot_clock),]
nrow(last_shots) # this tells me there is 5567 shots taken when the shot 
# clock was turned off at the end of a quarter
# setting these NAs equal to zero
nba_shots_clean[is.na(nba_shots_clean)] <- 0
# checking to see if it worked
nrow(nba_shots_clean[is.na(nba_shots_clean$shot_clock),]) # it worked 

# create a test and train set
split = sample.split(nba_shots_clean, SplitRatio=0.75)
nbaTrain = subset(nba_shots_clean, split==TRUE)
nbaTest = subset(nba_shots_clean, split==FALSE)
# logistic regression
nbaLogitModel <- glm(FGM ~ location + shot_number + defender_distance + 
points + dribbles + shot_clock + quarter + touch_time, data=nbaTrain, 
family="binomial", na.action = na.omit)

nbaPredict = predict(nbaLogitModel, newdata=nbaTest, type="response")
cm = table(nbaTest$FGM, nbaPredict > 0.5)
print(cm)

This gives me the output of the following, which tells me the prediction didn't do anything, as it's the same as before.

   FALSE  TRUE
0 21428     0
1   0    17977

I would really appreciate any guidance.

Chris95
  • 75
  • 1
  • 10
  • 1
    try to read this: https://stats.stackexchange.com/questions/5354/logistic-regression-model-does-not-converge – staove7 Apr 26 '17 at 20:23
  • 1
    Try to provide a minimal [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input data. It's much harder to help you if we can't run the code. – MrFlick Apr 26 '17 at 20:25
  • @MrFlick is providing the csv file by link not good enough? – Chris95 Apr 26 '17 at 21:34
  • 1
    No. Linking to data on other sites is fragile and potentially unsafe. If the problem is really just with the model fitting, then all the data manipulation code is just noise that's distracting from the real problem. You are more likely to get help if you make it as easy as possible to help you. – MrFlick Apr 26 '17 at 21:45
  • Ok, I can see why that makes it a lot easier for people to help. I got an answer below, but for future questions I will make sure to provide minimal reproducible examples with sample input data. Thank you! – Chris95 Apr 27 '17 at 05:17

1 Answers1

2

The confusion matrix of your model (model prediction vs. nbaTest$FGM) tells you that your model has a 100% accuracy !
This is due to the points variable in your dataset which is perfectly associated to the dependent variable:

table(nba_shots_clean$points, nba_shots_clean$FGM)
        0     1
  0 87278     0
  2     0 58692
  3     0 15133

Try to delete points from your model:

# create a test and train set
set.seed(1234)
split = sample.split(nba_shots_clean, SplitRatio=0.75)
nbaTrain = subset(nba_shots_clean, split==TRUE)
nbaTest = subset(nba_shots_clean, split==FALSE)

# logistic regression
nbaLogitModel <- glm(FGM ~ location + shot_number + defender_distance + 
dribbles + shot_clock + quarter + touch_time, data=nbaTrain, 
family="binomial", na.action = na.omit)
summary(nbaLogitModel)

No warning messages now and the estimated model is:

Call:
glm(formula = FGM ~ location + shot_number + defender_distance + 
    dribbles + shot_clock + quarter + touch_time, family = "binomial", 
    data = nbaTrain, na.action = na.omit)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-3.8995  -1.1072  -0.9743   1.2284   1.6799  

Coefficients:
                   Estimate Std. Error z value       Pr(>|z|)    
(Intercept)       -0.427688   0.025446 -16.808        < 2e-16 ***
locationH          0.037920   0.012091   3.136        0.00171 ** 
shot_number        0.007972   0.001722   4.630 0.000003656291 ***
defender_distance -0.006990   0.002242  -3.117        0.00182 ** 
dribbles           0.010582   0.004859   2.178        0.02941 *  
shot_clock         0.032759   0.001083  30.244        < 2e-16 ***
quarter           -0.043100   0.007045  -6.118 0.000000000946 ***
touch_time        -0.038006   0.005700  -6.668 0.000000000026 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 153850  on 111532  degrees of freedom
Residual deviance: 152529  on 111525  degrees of freedom
AIC: 152545

Number of Fisher Scoring iterations: 4

The confusion matrix is:

nbaPredict = predict(nbaLogitModel, newdata=nbaTest, type="response")
cm = table(nbaTest$FGM, nbaPredict > 0.5)
print(cm)

  FALSE  TRUE
0 21554  5335
1 16726  5955
Marco Sandri
  • 23,289
  • 7
  • 54
  • 58