User-defined function to iterate through factor levels in a regression

Question

I am a beginner in R so I'm sorry if my question is basic and has been answered somewhere else but unfortunately I could not find the answer.

One of my predictor variables, nationality, has 8 levels. I want to create a user defined function that loops through each level in my variable nationality, taking one level per regression. I created a list of the levels of the variable nationalityas such:

mylist <- list("bangladeshian", "british", "filipino", "indian",
               "indonesian", "nigerian", "pakistani", "spanish")

then created a user defined function:

f1 <- function(x) { 
  l <- summary(glm(smoke ~ I(nationality == mylist[x]),
                   data=df.subpop, family=binomial(link="probit")))
  print(l)
}

f1(2)

f1(2) gives this output:

Call:
glm(formula = smoke ~ I(nationality == mylist[x]), 
    family = binomial(link = "probit"), data = df.subpop)

Deviance Residuals: 
   Min      1Q  Median      3Q     Max  
-0.629  -0.629  -0.629  -0.629   1.853  

Coefficients:
                                Estimate Std. Error z value Pr(>|z|)    
(Intercept)                      -0.9173     0.1659  -5.530 3.21e-08 ***
I(nationality == mylist[x])TRUE  -4.2935   376.7536  -0.011    0.991    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 73.809  on 78  degrees of freedom
Residual deviance: 73.416  on 77  degrees of freedom
AIC: 77.416

Number of Fisher Scoring iterations: 14

As you can see, the coefficient for nationality is "I(nationality == mylist[x])TRUE" which is not very informative and requires the user to refer back to the line of code f1(2) and also to mylist to understand the level that that coefficient represents. I believe there should be a cleaner and more straightforward way to do this and accurately run a regression for each level without having to call f1() 8 times.

Do you want to have ```nationality``` as dummy variable in one regression model or do you want to have a regression for each ```nationality```? — timm, Dec 20 '21 at 16:45
I want to conduct subpopulation analysis. Not sure which of those is more appropriate actually. — activeR1234, Dec 20 '21 at 17:12
Can you post a `dput` of your data for a [mcve]? See [How to make a great R reproducible example?](https://stackoverflow.com/q/5963269/1422451) — Parfait, Dec 20 '21 at 18:26

Parfait · Accepted Answer · 2021-12-20T20:57:38.247

0

Consider dynamically building formula with as.formula or reformulate:

nationality_levels <- levels(df.subpop$nationality)

f1 <- function(x) { 
  # BUILD FORMULA (EQUIVALENT CALLS)
  f <- as.formula(paste0("smoke ~ I(nationality == '", x, "')"))
  f <- reformulate(paste0("I(nationality == '", x, "')"), "smoke")

  l <- summary(
    glm(f, data=df.subpop, family=binomial(link="probit"))
  )
}

reg_list <- lapply(nationality_levels, f1)
reg_list

edited Dec 20 '21 at 20:57

answered Dec 20 '21 at 16:49

Parfait

104,375
17
94
125

Thank you for your response. I do not think this is taking each level separately for two reasons. First, the coefficient on the level of interest is different when using this method vs. when doing lm<-summary(glm(smoke~I(nationality=="indian") ,data=df.subpop,family=binomial(link="probit"))) . Second, all other levels of "nationality" are included as coefficients in every one of the regressions ran by reg_list. – activeR1234 Dec 20 '21 at 17:02
Oh! I see you want to run model on entire data frame but the `I()` on each specific nationality. Yes, this solution does not do that but splits df by nationality and runs exact same model on each subset. – Parfait Dec 20 '21 at 17:40
See revamped answer which should display as you intend. – Parfait Dec 20 '21 at 20:59
Exactly! This worked perfectly, Thank you! – activeR1234 Dec 21 '21 at 07:38

User-defined function to iterate through factor levels in a regression

1 Answers1