As a continuation from this question, I'm trying to efficiently perform many logistic regression in order to generate a column saying if a group differs significantly from my reference group.
When I try to nest my data by just one column, this solution works beautifully. However, now that I need to group by two columns, the code runs, but I cannot change the reference group. I've tried the following:
- Adding a relevel argument (shown below)
- Adding a relevel argument within the custom function itself (also shown below)
- Renaming the desired reference group to start with 'AAA' to trick R into making it the first option
Here's a sample dataset:
library(dplyr)
library(lubridate)
library(tidyr)
library(purrr)
library(broom)
test <- tibble(
major = as.factor(c(rep(c("undeclared", "computer science", "english"), 2), "undeclared")),
app_deadline = ymd(c(rep("'2021-04-04", 3), rep("'2020-03-23", 3), rep("'2019-05-23", 1))),
time = ymd(c(rep("'2021-01-01", 3), rep("'2020-01-01", 3), rep("'2019-01-01", 1))),
admit = c(500, 1000, 450, 800, 300, 100, 1000),
reject = c(1000, 300, 1000, 210, 100, 900, 1500)
)
test2 <- test %>%
mutate(total = rowSums(test[ , c("admit", "reject")], na.rm=TRUE)) %>%
mutate(accept_rate = admit/total)
Here's the code that won't let me change the reference level:
#Custom function --note that english has been set as reference level
library(tidyr)
library(dplyr)
library(purrr)
library(broom)
get_model_t <- function(df) {
tryCatch(
expr = glm(accept_rate ~ relevel(major, ref = "english"), data = df, family = binomial, weights = total, na.action = na.exclude),
error = function(e) NULL, warning=function(w) NULL)
}
#putting it altogether--note again that english has been marked as reference level
test2 %>%
# create year column
mutate(year = year(time),
major = relevel(major, "english")) %>%
# nest by year
group_nest(year, app_deadline) %>%
# compute regression
mutate(reg = map(data, get_model_t), reg_tidy = map(reg, tidy)) %>%
# get data and regression results back to tibble form
unnest(c(data, reg_tidy)) %>%
filter(term != "(Intercept)") %>%
# create the significant yes/no column
mutate(significant = ifelse(p.value < 0.05, "Yes", "No")) %>%
# remove the unnecessary columns
select(-c(term, estimate, std.error, statistic, p.value, reg)) %>%
full_join(test2)
#Note that, based on the significance column, it's clear that 'undeclared' is being used as the reference group
Why is this happening? For a solution, I'd prefer if it could be flexible--i.e., not just work for 'english'
but could also be switched to work for 'computer science'
too.