How can I run multiple stepwise linear regressions at once?

Question

I am trying to predict which variables impact lift, which is sales rate for food goods on promotion. In my dataset, lift is my dependent variable and I have eight possible independent variables.Here are the first couple of rows of my dataset.

I need to do this analysis for 20 different products across 30 different stores. I want to know if it is possible to run 20 regressions on all of the products simultaneously in R. This way I would only have to run 30 regressions manually, one for each store, and I would get results for each store. I would like to use stepwise because this is what I am familiar with.

Here is the code I have written so far using only one regression at a time:

    data0<- subset(data0, Store == "Store 1")
    data0<- subset(data0, Product == "Product 1")
    
    ########Summary Stats
    head(data0)
    summary(data0)
    str(data0)
    
    ###Data Frame
    data0<-pdata.frame(data0, index=c("Product","Time"))
    data0<-data.frame(data0)

    ###Stepwise
    step_qtr_1v<- lm(Lift ~
              + Depth
             + Length
             + Copromotion
             + Category.Sales.On.Merch
             + Quality.Support.Binary
             
             
             , data = data0)
    summary(step_qtr_1v)

I am new to R so would appreciate simplicity. Thank you.

Can you add a [minimal reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) to your post as well as your dataset (or use an example dataset; e.g., `iris`, `mtcars`) and any relevant code you've written so far? — jrcalabrese, Jan 02 '23 at 18:44
Please edit this question make it minimal and reproducible - follow the links kindly provided by @jrcalabrese — jpsmith, Jan 02 '23 at 19:03
Instead of providing links to code as images, can you please include the code in a code chunk, so we can see it directly? — chrimaho, Jan 02 '23 at 21:17

score 0 · Accepted Answer · edited Jan 03 '23 at 14:49

Its really important to follow the guidelines when asking a question. Nonetheless, I've made a toy example with the iris dataset.

In order to run the same regressions multiple times over different parts of your dataset, you can use the lapply() function, which applies a function over a vector or list (in this case, the name of the species). The only thing you have to do is pass this to the subset argument in the lm() function:

data("iris")
 
species <- unique(iris$Species)
species

Running species shows the levels of this variable:

[1] setosa     versicolor virginica 
Levels: setosa versicolor virginica

And running colnames(iris) tells us what variables to use:

[1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width"  "Species"

The lapply function can be run thereafter like so:

models <- lapply(species, function(x) {
   lm(Petal.Length ~ Petal.Width + Sepal.Length + Sepal.Width,
    data = iris, subset = iris$Species == x)
 })
 
lapply(models, summary)

The result:

[[1]]

Call:
lm(formula = Petal.Length ~ Petal.Width + Sepal.Length + Sepal.Width, 
    data = iris, subset = iris$Species == x)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.38868 -0.07905  0.00632  0.10095  0.48238 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)  
(Intercept)   0.86547    0.34331   2.521   0.0152 *
Petal.Width   0.46253    0.23410   1.976   0.0542 .
Sepal.Length  0.11606    0.10162   1.142   0.2594  
Sepal.Width  -0.02865    0.09334  -0.307   0.7602  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.1657 on 46 degrees of freedom
Multiple R-squared:  0.1449,    Adjusted R-squared:  0.08914 
F-statistic: 2.598 on 3 and 46 DF,  p-value: 0.06356


[[2]]

Call:
lm(formula = Petal.Length ~ Petal.Width + Sepal.Length + Sepal.Width, 
    data = iris, subset = iris$Species == x)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.61706 -0.13086 -0.02966  0.09854  0.54311 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)   0.16506    0.40032   0.412    0.682    
Petal.Width   1.36021    0.23569   5.771 6.37e-07 ***
Sepal.Length  0.43586    0.07938   5.491 1.67e-06 ***
Sepal.Width  -0.10685    0.14625  -0.731    0.469    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.2319 on 46 degrees of freedom
Multiple R-squared:  0.7713,    Adjusted R-squared:  0.7564 
F-statistic: 51.72 on 3 and 46 DF,  p-value: 8.885e-15


[[3]]

Call:
lm(formula = Petal.Length ~ Petal.Width + Sepal.Length + Sepal.Width, 
    data = iris, subset = iris$Species == x)

Residuals:
    Min      1Q  Median      3Q     Max 
-0.7325 -0.1493  0.0516  0.1555  0.5866 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)   0.46503    0.47686   0.975    0.335    
Petal.Width   0.21565    0.17410   1.239    0.222    
Sepal.Length  0.74297    0.07129  10.422 1.07e-13 ***
Sepal.Width  -0.08225    0.15999  -0.514    0.610    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.2819 on 46 degrees of freedom
Multiple R-squared:  0.7551,    Adjusted R-squared:  0.7391 
F-statistic: 47.28 on 3 and 46 DF,  p-value: 4.257e-14

BTW, you are not performing any stepwise regression in your code. But the above example can be easily modified to do so.

Hope this helps.

FYI it is not good practice to include the `>` indents in your code because people cannot easily copy and paste the code into R. I have also contextualized some of your answer because it wasn't always super clear what you meant in some parts. Feel free to remove my edits if you feel they weren't useful. — Shawn Hemelstrand, Jan 03 '23 at 14:50
@ShawnHemelstrand thank you for your edit. Now its more clear. The only thing that I wouldn't remove is the "Hello, and welcome to SO". — Santiago Capobianco, Jan 03 '23 at 15:10
Weird. For some reason I keep editing that in and it wont save it. Not sure why. In any case you can edit it yourself if you feel its more welcoming. — Shawn Hemelstrand, Jan 03 '23 at 15:22
Thanks. @SantiagoCapobianco and ShawnHemelstrand. This was very helpful. One quick question. I am only getting 18 regressions for my results, when I have 20 products in my dataset. Could you think of any reason why this is happening? Also, are the regressions in the same order the products are in in the dataset? — Sam, Jan 03 '23 at 16:15
@Sam Check the length of the vector that you are using in lapply(). If the answer fitted your needs, don't forget to accept it. — Santiago Capobianco, Jan 03 '23 at 16:48

How can I run multiple stepwise linear regressions at once?

1 Answers1