I´ve spent days searching for the optimal models which would fulfill all of the standard OLS assumptions (normal distribution, homoscedasticity, no multicollinearity) in R but with 12 variables, it´s impossible to find the optimal var combination. So I was trying to create a script which would automatize this process.
Here the sample code for calculations:
x1 <- runif(100, 0, 10)
x2 <- runif(100, 0, 10)
x3 <- runif(100, 0, 10)
x4 <- runif(100, 0, 10)
x5 <- runif(100, 0, 10)
df <- as.data.frame(cbind(x1,x2,x3,x4,x5))
library(lmtest)
library(car)
model <- lm(x1~x2+x3+x4+x5, data = df)
# check for normal distribution (Shapiro-Wilk-Test)
rs_sd <- rstandard(model)
shapiro.test(rs_sd)
# check for heteroskedasticity (Breusch-Pagan-Test)
bptest(model)
# check for multicollinearity
vif(model)
#-------------------------------------------------------------------------------
# models without outliers
# identify outliers (calculating the Cooks distance, if x > 4/(n-k-1) --> outlier
cooks <- round(cooks.distance(model), digits = 4)
df_no_out <- cbind(df, cooks)
df_no_out <- subset(df_no_out, cooks < 4/(100-4-1))
model_no_out <- lm(x1~x2+x3+x4+x5, data = df_no_out)
# check for normal distribution
rs_sd_no_out<- rstandard(model_no_out)
shapiro.test(rs_sd_no_out)
# check for heteroskedasticity
bptest(model_no_out)
# check for multicollinearity
vif(model_no_out)
What I have in mind is to loop through all of the var combinations and get the P-VALUES for the shapiro.test() and the bptest() or the VIF-values for all models created so I can compare the significance values or the multicollinearity resp. (in my dataset, the multicollinearity shouldn´t be a problem and since to check for multicollinearity the VIF test produces more values (for each var 1xVIF factor) which will be probably more challenging for implementing in the code), the p-values for shapiro.test + bptest() would suffice…).
I´ve tried to write several scripts which would automatize the process but without succeed (unfortunately I´m not a programmer). I know there´re already some threads dealing with this problem
How to run lm models using all possible combinations of several variables and a factor
Finding the best combination of variables for high R-squared values
but I haven´t find a script which would also calculate JUST the P-VALUES.
Especially the tests for models without outliers are important because after removing the outliers the OLS assumptions are fullfilled in many cases.
I would really very appreciate any suggestions or help with this.