Is there a method for iterating data frame variables in a formula object?

Question

In my case, I'm hoping to compute different glm and lda models for a certain subset. Y variable or output is the same in each model, but a forward best subset selection model is carried out for the variables found most significant in a random forest analysis.

However, when trying to iterate I can't find anything that could work as follows

#Ordered data frame (ordered_df_train) is just the data frame ordered using the previously mentioned #method, considering the first variable to be crim (the output)
list_formula <- vector(mode = "list", length = 13)
list_formula[[1]] <- ordered_df_train$crim ~ ordered_df_train$age
for(j in 3:14){
  list_formula[[j-1]] <- ordered_df_train$colnames(ordered_df_train)[j]
}

However,

ordered_df_train$colnames(ordered_df_train)[j]

execution reports NULL, therefore, not taking the variable expected.

Edit: As suggested, the previously used data for reproducibility is defined as:

library(MASS)
df_train <- Boston
ordered_df_train <- data.frame(
    crim = df_train$crim,
    age = df_train$age,
    nox = df_train$nox,
    tax = df_train$tax,
    indus = df_train$indus,
    dis = df_train$dis,
    rad = df_train$rad,
    black = df_train$black,
    rm = df_train$rm,
    lstat = df_train$lstat,
    zn = df_train$zn,
    ptratio = df_train$ptratio,
    medv = df_train$medv,
    chas = df_train$chas
)

Hope this allows a execution of my question. The objective is to have a list of formulas based on the forward method for best subsect selection by adding after each iteration the next most significative variable.

It's easier to help you if you include a simple [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input and desired output that can be used to test and verify possible solutions. — MrFlick, Nov 18 '22 at 20:34
Done, let me know if there is more I can clarify in order to make it easier to resolve. — MVS, Nov 18 '22 at 20:43

score 0 · Accepted Answer · answered Nov 18 '22 at 21:22

Currently, you are not calling colnames properly. It is a base package method and not an element of a data frame accessed with $. Even so, you need to convert string values to formula such as with as.formula.

Also, consider adjusting your call with lapply and avoid the bookkeeping of initializing a list and then iteratively assign elements by index. Use [-1] to subset out the first column name element.

list_formula <- lapply(
  colnames(ordered_df_train)[-1],
  function(col) as.formula(
    paste("ordered_df_train$crim ~ ordered_df_train$", col)
  )
)

list_formula
# [[1]]
# ordered_df_train$crim ~ ordered_df_train$age
# <environment: 0x000002842a33f240>
#   
# [[2]]
# ordered_df_train$crim ~ ordered_df_train$nox
# <environment: 0x000002842a32c270>
#   
# [[3]]
# ordered_df_train$crim ~ ordered_df_train$tax
# <environment: 0x000002843931fd10>
#   
# [[4]]
# ordered_df_train$crim ~ ordered_df_train$indus
# <environment: 0x00000284365dc340>
#   
# [[5]]
# ordered_df_train$crim ~ ordered_df_train$dis
# <environment: 0x00000284379d9800>
#   
# [[6]]
# ordered_df_train$crim ~ ordered_df_train$rad
# <environment: 0x00000284379d7fb8>
#   
# [[7]]
# ordered_df_train$crim ~ ordered_df_train$black
# <environment: 0x00000284393cf6e0>
#   
# [[8]]
# ordered_df_train$crim ~ ordered_df_train$rm
# <environment: 0x00000284379ef078>
#   
# [[9]]
# ordered_df_train$crim ~ ordered_df_train$lstat
# <environment: 0x000002843959d320>
#   
# [[10]]
# ordered_df_train$crim ~ ordered_df_train$zn
# <environment: 0x000002843959bad8>
#   
# [[11]]
# ordered_df_train$crim ~ ordered_df_train$ptratio
# <environment: 0x00000284393e4ba8>
#   
# [[12]]
# ordered_df_train$crim ~ ordered_df_train$medv
# <environment: 0x00000284366e3348>
#   
# [[13]]
# ordered_df_train$crim ~ ordered_df_train$chas
# <environment: 0x00000284364db798>

Consider also reformulate and build formula without as.formula + paste. Below will not include the data frame qualifier but you may be able to pass data frame into the data argument of your modeling method.

list_formula <- lapply(
  colnames(ordered_df_train)[-1], function(col) reformulate(col, "crim")
)

list_formula
# [[1]]
# crim ~ age
# <environment: 0x000002843a203a18>
#   
# [[2]]
# crim ~ nox
# <environment: 0x000002843a20ad68>
#   
# [[3]]
# crim ~ tax
# <environment: 0x000002843a274678>
#   
# [[4]]
# crim ~ indus
# <environment: 0x000002843a279b18>
#   
# [[5]]
# crim ~ dis
# <environment: 0x000002843a282de8>
#   
# [[6]]
# crim ~ rad
# <environment: 0x000002843a286368>
#   
# [[7]]
# crim ~ black
# <environment: 0x000002843a2898e8>
#   
# [[8]]
# crim ~ rm
# <environment: 0x000002843a28ed88>
#   
# [[9]]
# crim ~ lstat
# <environment: 0x000002843a296138>
#   
# [[10]]
# crim ~ zn
# <environment: 0x000002843a2996b8>
#   
# [[11]]
# crim ~ ptratio
# <environment: 0x000002843a29eb58>
#   
# [[12]]
# crim ~ medv
# <environment: 0x000002843a2a5f08>
#   
# [[13]]
# crim ~ chas
# <environment: 0x000002843a2a9488>

That was it, although the intention was to have a formula of the fashion crim ~ order_of_variables[1:j] which I included in a for, as more functions were implemented under the loop. However, thank you very much for the explanation, the reformulate function was a lifesaver — MVS, Nov 22 '22 at 12:25

Is there a method for iterating data frame variables in a formula object?

1 Answers1