R run linear model by group in dataset

Question

My dataset looks like this

df = data.frame(site=c(rep('A',95),rep('B',110),rep('C',250)),
                nps_score=c(floor(runif(455, min=0, max=10))),
                service_score=c(floor(runif(455, min=0, max=10))),
                food_score=c(floor(runif(455, min=0, max=10))),
                clean_score=c(floor(runif(455, min=0, max=10))))

I'd like to run a linear model on each group (i.e. for each site), and produce the coefficients for each group in a dataframe, along with the significance levels of each variable.

I am trying to group_by the site variable and then run the model for each site but it doesn't seem to be working. I've looked at some existing solutions on stack overflow but cannot seem to adapt the code to my solution.

#Trying to run this by group, and output the resulting coefficients per site in a separate df with their signficance levels.

library(MASS)
summary(ols <- rlm(nps_score ~ ., data = df))

Any help on this would be greatly appreciated

If you are looking for a base R solution, you might just loop through all levels of `site`, store the results in a respective data frame and then merge all results data frames in an appropriate way (e.g by `rbind`). — deschen, Dec 10 '20 at 20:21

deschen · Answer 1 · 2020-12-10T20:16:29.227

library(tidyverse)
library(broom)
library(MASS)

# We first create a formula object
my_formula <- as.formula(paste("nps_score ~ ", paste(df %>% select(-site, -nps_score) %>% names(), collapse= "+")))

# Now we can group by site and use the formula object within the pipe.   
results <- df %>%
  group_by(site) %>%
  do(tidy(rlm(formula(my_formula), data = .)))

which gives:

# A tibble: 12 x 5
# Groups:   site [3]
   site  term          estimate std.error statistic
   <chr> <chr>            <dbl>     <dbl>     <dbl>
 1 A     (Intercept)     5.16      0.961      5.37 
 2 A     service_score  -0.0656    0.110     -0.596
 3 A     food_score     -0.0213    0.102     -0.209
 4 A     clean_score    -0.0588    0.110     -0.536
 5 B     (Intercept)     2.22      0.852      2.60 
 6 B     service_score   0.221     0.103      2.14 
 7 B     food_score      0.163     0.104      1.56 
 8 B     clean_score    -0.0383    0.0928    -0.413
 9 C     (Intercept)     5.47      0.609      8.97 
10 C     service_score  -0.0367    0.0721    -0.509
11 C     food_score     -0.0585    0.0724    -0.808
12 C     clean_score    -0.0922    0.0691    -1.33

Note: i'm not familiar with the rlm function and if it provides p-values in the first place. But at least the tidy function doesn't offer p-values for rlm. If a simple linear regression would fit your suits, you could replace the rlm function by lm in which case a sixth column with p-values would be added.

R run linear model by group in dataset

1 Answers1