I am working on a tv retail dataset in R and wanted to put steps I will need to use repeatedly into a function.
This includes checking the VIF and return it, run the STEP algorithm to determine the best model and then use the result of the STEP and display it.
The major issue is the error message
Error in eval(predvars, data, env) : object 'Hour' not found
which appears to appear in the step() call.
Regression <- function(data, dep_var, features) {
lin.null = lm(paste(dep_var,'~ 1', sep = ''), data= data)
lin.full = lm(paste(dep_var,'~', paste(features, collapse='+'), sep = ''), data = data)
vif(lin.full)
opt = step(lin.null, scope = list(lower = lin.null, upper = lin.full), direction = "forward")
step_opt = opt$call
stargazer(step_opt, type = 'text')
}
dep_var = 'imp'
feat = c('Hour', 'grp')
paste(dep_var,'~', paste(feat, collapse='+'), sep = '')
Regression(comb_a, 'imp', feat)
The final result should show me the VIF values for each variable and the stargazer output of the STEP optimized regression.
EDIT 1:
comb_a is the input data the regression should take The dput() output follows down below:
# comb_a
structure(list(Day = structure(c(1483833600, 1483833600, 1483833600,
1483833600, 1483833600, 1483833600), class = c("POSIXct", "POSIXt"
), tzone = "UTC"), Hour = c(0, 1, 6, 7, 8, 9), Model = c("Model A",
"Model A", "Model A", "Model A", "Model A", "Model A"), tv_count = c(5L,
8L, 4L, 9L, 11L, 8L), grp_abs = c(55500, 8308, 19026, 12184,
10141, 113225), grp = c(0.22, 0.03, 0.07, 0.05, 0.04, 0.45),
sum_duration = c(150, 240, 120, 270, 330, 240), grp_per_second = c(370,
34.6166666666667, 158.55, 45.1259259259259, 30.730303030303,
471.770833333333), hours_since = c(NA, 1, 5, 1, 1, 1), camp_count = c(2L,
2L, 2L, 2L, 3L, 4L), imp = c(528, 319, 97, 182, 327, 785),
clicks = c(28, 15, 6, 13, 29, 53), leads = c(0, 0, 0, 0,
0, 1)), .Names = c("Day", "Hour", "Model", "tv_count", "grp_abs",
"grp", "sum_duration", "grp_per_second", "hours_since", "camp_count",
"imp", "clicks", "leads"), row.names = c(NA, -6L), class = c("grouped_df",
"tbl_df", "tbl", "data.frame"), vars = c("Day", "Hour"), drop = TRUE, indices = list(
0L, 1L, 2L, 3L, 4L, 5L), group_sizes = c(1L, 1L, 1L, 1L,
1L, 1L), biggest_group_size = 1L, labels = structure(list(Day = structure(c(1483833600,
1483833600, 1483833600, 1483833600, 1483833600, 1483833600), class = c("POSIXct",
"POSIXt"), tzone = "UTC"), Hour = c(0, 1, 6, 7, 8, 9)), row.names = c(NA,
-6L), class = "data.frame", vars = c("Day", "Hour"), drop = TRUE, .Names = c("Day",
"Hour")))
desired output would be: (Numbers are just for representation)
> vif(lin.full)
Hour grp sum_duration grp_per_second hours_since camp_count
2.979362 4.981504 2.290328 3.279818 1.013725 1.110823
imp clicks
7.471457 9.244811
> stargazer(step_opt, type = 'text')
===============================================
Dependent variable:
---------------------------
leads
-----------------------------------------------
clicks 0.005***
(0.0004)
camp_count 0.040*
(0.024)
Constant -0.107
(0.098)
-----------------------------------------------
Observations 898
R2 0.181
Adjusted R2 0.179
Residual Std. Error 0.772 (df = 895)
F Statistic 98.901*** (df = 2; 895)
===============================================
Note: *p<0.1; **p<0.05; ***p<0.01