I have a training set with x columns representing a specific stadium a match is being played. Clearly the columns are linearly dependent in the training set as a match must occur in at least one of the stadiums.
However the issue I have is if I pass in test data, it may include a stadium not seen in the training data. I would therefore like to include all x columns in training a R glm, such that the average of each stadiums coefficient is zero. Then if a new stadium is seen, it will essentially be given the average of all the stadiums coefficients.
The issue I have is that the R glm function seems to detect I have linearly dependent columns in my training set and sets one of the coefficients to NA to make the rest of them linearly independent. How do I:
Stop R inserting the NA coefficient for one of my columns in the glm function AND ensure all the stadium coefficients sum to 0?
Some example code
# Past observations
outcome = c(1 ,0 ,0 ,1 ,0 ,1 ,0 ,0 ,1 ,0 ,1 )
skill = c(0.1,0.5,0.6,0.3,0.1,0.3,0.9,0.6,0.5,0.1,0.4)
stadium_1 = c(1 ,1 ,0 ,0 ,0 ,0 ,0 ,0 ,0 ,0 ,0 )
stadium_2 = c(0 ,0 ,1 ,1 ,1 ,1 ,1 ,0 ,0 ,0 ,0 )
stadium_3 = c(0 ,0 ,0 ,0 ,0 ,0 ,0 ,1 ,1 ,1 ,1 )
train_glm_data = data.frame(outcome, skill, stadium_1, stadium_2, stadium_3)
LR = glm(outcome ~ . - outcome, data = train_glm_data, family=binomial(link='logit'))
print(predict(LR, type = 'response'))
# New observations (for a new stadium we have not seen before)
skill = c(0.1)
stadium_1 = c(0 )
stadium_2 = c(0 )
stadium_3 = c(0 )
test_glm_data = data.frame(outcome, skill, stadium_1, stadium_2, stadium_3)
print(predict(LR, test_glm_data, type = 'response'))
# Note that in this case, the observation is simply the same as if we had observed stadium_3
# Instead I would like it to be an average of all the known stadiums coefficients
# If they all sum to 0 this is essentially already done for me
# However if not then the stadium_3 coefficient is buried somewhere in the intercept term