2

I have a training set with x columns representing a specific stadium a match is being played. Clearly the columns are linearly dependent in the training set as a match must occur in at least one of the stadiums.

However the issue I have is if I pass in test data, it may include a stadium not seen in the training data. I would therefore like to include all x columns in training a R glm, such that the average of each stadiums coefficient is zero. Then if a new stadium is seen, it will essentially be given the average of all the stadiums coefficients.

The issue I have is that the R glm function seems to detect I have linearly dependent columns in my training set and sets one of the coefficients to NA to make the rest of them linearly independent. How do I:

Stop R inserting the NA coefficient for one of my columns in the glm function AND ensure all the stadium coefficients sum to 0?

Some example code

# Past observations
outcome   = c(1  ,0  ,0  ,1  ,0  ,1  ,0  ,0  ,1  ,0  ,1  )
skill     = c(0.1,0.5,0.6,0.3,0.1,0.3,0.9,0.6,0.5,0.1,0.4)
stadium_1 = c(1  ,1  ,0  ,0  ,0  ,0  ,0  ,0  ,0  ,0  ,0  )
stadium_2 = c(0  ,0  ,1  ,1  ,1  ,1  ,1  ,0  ,0  ,0  ,0  )
stadium_3 = c(0  ,0  ,0  ,0  ,0  ,0  ,0  ,1  ,1  ,1  ,1  )

train_glm_data = data.frame(outcome, skill, stadium_1, stadium_2,     stadium_3)
LR = glm(outcome ~ . - outcome, data = train_glm_data,  family=binomial(link='logit'))
print(predict(LR, type = 'response'))

# New observations (for a new stadium we have not seen before)
skill     = c(0.1)
stadium_1 = c(0  )
stadium_2 = c(0  )
stadium_3 = c(0  )

test_glm_data = data.frame(outcome, skill, stadium_1, stadium_2, stadium_3)
print(predict(LR, test_glm_data, type = 'response'))

# Note that in this case, the observation is simply the same as if we had observed stadium_3
# Instead I would like it to be an average of all the known stadiums coefficients
# If they all sum to 0 this is essentially already done for me
# However if not then the stadium_3 coefficient is buried somewhere in the intercept term
rwolst
  • 12,904
  • 16
  • 54
  • 75
  • 2
    What is the `glm` model that you are fitting? It would help if you provide a minimal reproducible example. – Weihuang Wong Sep 25 '16 at 11:55
  • Possible duplicate of [linear regression "NA" estimate just for last coefficient](http://stackoverflow.com/questions/7337761/linear-regression-na-estimate-just-for-last-coefficient) – Sandipan Dey Sep 25 '16 at 18:52

2 Answers2

1
train_glm_data$stadium <- NA
train_glm_data$stadium[train_glm_data$stadium_1==1] <- "Stadium 1"
train_glm_data$stadium[train_glm_data$stadium_2==1] <- "Stadium 2"
train_glm_data$stadium[train_glm_data$stadium_3==1] <- "Stadium 3"
train_glm_data$stadium_1 <- NULL
train_glm_data$stadium_2 <- NULL
train_glm_data$stadium_3 <- NULL

train_glm_data$stadium         <- as.factor(train_glm_data$stadium)
levels(train_glm_data$stadium) <- c("Stadium 1", "Stadium 2", "Stadium 3", "Stadium 4")
train_glm_data                 <- rbind(train_glm_data, c(
                                      round(mean(outcome)), mean(skill),
                                      "Stadium 4"
                                    ))
train_glm_data$outcome <- as.numeric(train_glm_data$outcome)
train_glm_data$skill   <- as.numeric(train_glm_data$skill)
LR = glm(outcome ~ stadium + skill, data = train_glm_data,  family=binomial(link='logit'))
print(predict(LR, type = 'response'))

# New observations (for a new stadium we have not seen before)
skill     = c(0.1)
stadium   = "Stadium 4"

test_glm_data = data.frame(skill, stadium)
print(predict(LR, test_glm_data, type = 'response'))

Regarding the question of how to include coefficients for all levels -- don't do this. It's called the dummy variable trap. The data matrix becomes singular if a reference level is not excluded.

The only exception being if you estimate a no-intercept model. Read more about the dummy variable trap here.

Community
  • 1
  • 1
Hack-R
  • 22,422
  • 14
  • 75
  • 131
1

To estimate coefficients for all your dummy variables, you can add "-1" to your formula, which will remove the intercept:

LR = glm(outcome ~ . - outcome - 1, data = train_glm_data, family=binomial(link='logit'))

Coefficients:

coef(LR)
#      skill  stadium_1  stadium_2  stadium_3 
# -2.8080177  0.8424053  0.7541226  1.1313135 

For the unseen training levels problem, @hack-r has proposed some good ideas. Another idea is to impute 1/n (where n is the number of observed stadiums) for all the dummy variables for the new observation.

Weihuang Wong
  • 12,868
  • 2
  • 27
  • 48