OLSRR package trouble with many variables in R

Question

I try to analyze all the variables in my data set to see which set of variables is describing my dependent variable StockPrice the best. The following code is the one I use to do so:

 install.packages("olsrr")
 library(olsrr)

model <- lm(StockPrice ~ ESGscore + MarketValue + ibc + ni + CommonEquity + AssetsTotal + ROA + ROE + MarketToBook + TobinQ + Liabilities + stock_ret_yr_0 + stock_ret_yr_minus1 + stock_ret_yr_plus1 + EPS + BookValuePS, data = Datensatz_Excel)
ols_step_best_subset(model)
A <- ols_step_best_subset(model)
plot(A)

And here is some data to reproduce it:

Please let me know if this works for you, I'm doing it for the first time. If there is a better way to provide some data (e.g. clearly arranged) other then using dput() please let me know! :)

structure(list(Company = c("AIR PRODUCTS & CHEMICALS INC", "AIR PRODUCTS & CHEMICALS INC", 
"AIR PRODUCTS & CHEMICALS INC", "AIR PRODUCTS & CHEMICALS INC", 
"AIR PRODUCTS & CHEMICALS INC", "AIR PRODUCTS & CHEMICALS INC", 
"AIR PRODUCTS & CHEMICALS INC"), Year = c(2011, 2012, 2013, 2014, 
2015, 2016, 2017), gvkey = c(1209, 1209, 1209, 1209, 1209, 1209, 
1209), ggroup = c(1510, 1510, 1510, 1510, 1510, 1510, 1510), 
    ESGscore = c(84.2750015258789, 81.9225006103516, 77.4024963378906, 
    80.1125030517578, 78.6449966430664, 76.3775024414062, 79.2699966430664
    ), MarketValue = c(17934.369140625, 17537.578125, 23639.79296875, 
    30868.392578125, 28037.404296875, 31271.359375, 35903.4921875
    ), ibc = c(1252.59997558594, 1025.19995117188, 1042.5, 988.5, 
    1317.59997558594, 1545.69995117188, 1155.19995117188), ni = c(1224.19995117188, 
    1167.30004882812, 994.200012207031, 991.700012207031, 1277.90002441406, 
    631.099975585938, 3000.39990234375), CommonEquity = c(5795.7998046875, 
    6477.2001953125, 7042.10009765625, 7365.7998046875, 7249, 
    7079.60009765625, 10086.2001953125), AssetsTotal = c(14290.7001953125, 
    16941.80078125, 17850.099609375, 17779.099609375, 17438.099609375, 
    18055.30078125, 18467.19921875), ROA = c(0.0906418636441231, 
    0.0816824957728386, 0.0586832538247108, 0.0555571131408215, 
    0.0718765333294868, 0.0361908674240112, 0.166178345680237
    ), ROE = c(0.220699846744537, 0.201404482126236, 0.153492242097855, 
    0.140824466943741, 0.17349100112915, 0.0870602801442146, 
    0.423809230327606), MarketToBook = c(3.09437346458435, 2.70758628845215, 
    3.35692381858826, 4.19077253341675, 3.86776161193848, 4.41710805892944, 
    3.55966472625732), TobinQ = c(1.84940338134766, 1.65284550189972, 
    1.92983758449554, 2.32192254066467, 2.19212555885315, 2.33987021446228, 
    2.39800786972046), Liabilities = c(8494.900390625, 10464.6005859375, 
    10807.9995117188, 10413.2998046875, 10189.099609375, 10975.7006835938, 
    8380.9990234375), StockPrice = c(85.19, 84.02, 111.78, 144.23, 
    130.11, 143.82, 164.08), stock_ret_yr_0 = c(-0.0378783643245697, 
    0.0164456591010094, 0.369286864995956, 0.321167588233948, 
    -0.076192781329155, 0.252576589584351, 0.170138001441956), 
    stock_ret_yr_minus1 = c(0.150884702801704, -0.0378783643245697, 
    0.0164456591010094, 0.369286864995956, 0.321167588233948, 
    -0.076192781329155, 0.252576589584351), stock_ret_yr_plus1 = c(0.0164456591010094, 
    0.369286864995956, 0.321167588233948, -0.076192781329155, 
    0.252576589584351, 0.170138001441956, 0.00247942004352808
    ), EPS = c(5.75, 5.53, 4.74, 4.66, 5.95, 2.92, 13.76), BookValuePS = c(27.21, 
    30.67, 33.58, 34.63, 33.73, 32.72, 46.27)), row.names = c(NA, 
-7L), class = c("tbl_df", "tbl", "data.frame"))

The problem is, whenever R must analyze 16 different variables the program just does not work. R shows the code in the lower box and puts "model" in the data box, but nothing happens after that. There is no error message or something like that. I've also tried waiting for 15min. but nothing happend.

If I just analyze 4-5 variables there is no problem at all.

Someone out there with the same problem and maybe some solution? :)

Happy New Year everybody and thanks for the help :)

Can you provide the data source or reproduce it? The best way to get great answers quickly is to make your question [reproducible](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). — Kat, Jan 03 '22 at 04:57
On a separate note, you may find that using a random forest model is better for particularly wide data. Set importance to true and that feature tells you which variables have the most influence on the outcome. You can use it to find the best combinations, as well. — Kat, Jan 03 '22 at 04:59
Hi @Kat I tried to provide some data. If you need more companies than just one please let me know! :) — wrangjangler, Jan 04 '22 at 13:33

score 1 · Accepted Answer · answered Jan 04 '22 at 19:35

BLUF or the more trendy TL;DR I think that the function ols_step_best_subset has limitations. However, there are other ways to get what you are looking for.

The long version:

Okay, I worked with the data you provided, but I did not run into any of the issues you ran into. I thought it may have been due to how few rows of data you had provided. (You provided a lot of information! There just isn't a lot for the model to work with.)

Instead of asking you for more data, since it seemed to be more of a question on the limitations of R, I found a built-in wide dataset. I still didn't run into the issues you ran into.

I used the data dhfr from the package caret. It has over 200 variables; I randomly chose 24.

I didn't clean it. I did look at the influential variables and that is likely going to be an issue. I did this as a way to look for multicollinearity which is a really big problem for linear regression. Since that wasn't a problem, I used this data.

library(tidyverse)
library(olsrr)
library(caret) 
library(randomForest)
library(car)

#------------------- Collect and Clean -------------------
data(dhfr, package = "caret")

# arbitrarily chose columns
dhfr2 <- dhfr[, 2:25]

# just checking what's there to work with
summary(dhfr2)

# check for multicollinearity, overly influential
vif(lm(moeGao_Abra_L~., data = dhfr2))
# if error, multicollinearity exists 
#    (multiple variables with the same information)
# high values are likely a source of a big problem in lm()
  # there are several with very high values here

# data isn't clean or explored for actual analysis, 
# but good enough to answer your inquiry

#----------------- Prepare for Modeling ------------------
set.seed(3926)
# partition for training and testing
tr <- createDataPartition(dhfr2[, 1], p = .7, list = F)

#---------------- Linear Regression Model ----------------
# use all remaining variables in dataset (24 predictors)
fit.lm <- lm(moeGao_Abra_L~., data = dhfr2[tr, ])
summary(fit.lm)

The model had an explained variance (R²) is .9629.

p <- predict(fit.lm, dhfr2[-tr, -1])
postResample(p, dhfr2[-tr, 1])
#      RMSE  Rsquared       MAE 
# 0.2416366 0.9416441 0.1744155  
# potentially an issue with overfitting
   # assumptions not assessed

If you want to assess a model this large instead of using ols_step_best_subset(), you could rfe().

You would have to create the lm model through caret.

set.seed(3926)
fit.lmT <- train(moeGao_Abra_L~., data = dhfr2[tr, ],
                 method = "lm")

You have to setup a controller first, but this is really just based on what type of model you are using. For lm, you really just need to know lmFuncs for lm functions.

This is going to use cross-validation.

ctrl = rfeControl(lmFuncs, 
                  method = "repeatedcv",
                  repeats = 5,
                  verbose = F)

Then you can apply rfe().

lmP <- rfe(x = dhfr2[tr, -1],
           y = dhfr2[tr, 1],
           sizes = 4:10,  
           rfeControl = ctrl)

In the call to rfe(), the sizes argument is important. From help for this function: "a numeric vector of integers corresponding to the number of features that should be retained." This looked for a group of four, a group of five, all the way to a group of ten that worked best for the outcome variable. You have a lot more things you can control here, as well.

You can read all the details about caret here.

The results of rfe():

# Recursive feature selection
# 
# Outer resampling method: Cross-Validated (10 fold, repeated 5 times) 
# 
# Resampling performance over subset size:
# 
#  Variables   RMSE Rsquared    MAE  RMSESD RsquaredSD   MAESD Selected
#          4 0.2814   0.9062 0.2243 0.05158    0.04136 0.04234         
#          5 0.2788   0.9079 0.2192 0.04495    0.03832 0.03776         
#          6 0.2743   0.9106 0.2149 0.04504    0.03794 0.03795         
#          7 0.2670   0.9137 0.2093 0.04296    0.03875 0.03498         
#          8 0.2542   0.9205 0.2008 0.04442    0.03920 0.03437         
#          9 0.2335   0.9331 0.1851 0.04086    0.03124 0.03231         
#         10 0.2203   0.9402 0.1754 0.03177    0.02793 0.02591         
#         23 0.2034   0.9489 0.1618 0.03034    0.02410 0.02428        *
# 
# The top 5 variables (out of 23):
#    moeGao_Abra_R, moe2D_GCUT_PEOE_3, moeGao_Abra_acidity, 
#    moe2D_BCUT_SMR_0, moe2D_BCUT_SMR_3

Wooow! Thank you so much @Kat! Although I have another 2 questions if you woudn't mind: I tried to check for multicollinearity with my data and I get the error message. Is there something I can do now? Where is the difference between `fit.lm()` and `lm()` and is it better to use `fit.lm()` instead? — wrangjangler, Jan 05 '22 at 13:58
If the error you received includes `there are aliased coefficients in the model`, that means that you have a problem with multicollinearity. Suppose any two variables with a correlation of about .8 or higher, one of the two variables shouldn't be included. I say 'real' because correlation has assumptions, as well. Anscombe's squares tell a great story about why Pearson's product-moment correlation coefficient has data assumptions: a linear relationship between variables, normal distribution, and no outliers. If not correlating, but has a high rho value, use both variables. — Kat, Jan 05 '22 at 14:19
I just realized I only answered one of your questions. `fit.lm` is the object name, not the function. I used `lm()` and `train()`. `train()` with `method = "lm"` is using the `caret` package and the base R function `lm()`. I find that `caret` is a fantastic resource. In the end, it's still using the same regression function, though. — Kat, Jan 06 '22 at 02:11

OLSRR package trouble with many variables in R

1 Answers1