0

I have done regression analysis in R many times but I am not able to get a hang of what is happening in this model.

I have income and demographic data for 500 people. I am trying to understand the impact of gender on following covid protocols and controlling for age, income and education. My dependent variable (e.g. mask wearing) is a factor (0 representing no mask, and 1 representing mask worn). Age is a numeric variable between 18 to 35, gender is a character variable (M & F), income has levels from 0 to 5 and education is also coded from 0 to 5 to represent different education levels.

Here is a reproducible example:

pand_data <- data.frame(
  Age = sample(25:30),
  Edu = sample(0:5),
  mask = sample((0:1), 6, replace = TRUE),
  gender = sample(c("m", "f"), 6, replace = TRUE),
  income = sample((1:5), 6, replace = TRUE))

glm(mask ~ gender + Age + income + Edu, data = pand_data, family = "binomial")

The output shows the intercept and then instead of showing the coefficient for Age as a variable, it shows Age18, Age19... Age35 as separate variables. Same is the case for income (income0, income1,...income5) and education levels. I converted the variables to factors and ran the same code but it didn't work either. My end goal is to calculate odds ratio and I have used package epiR previously, but that doesn't work with this either.

I have never faced this before and I have tried to tweak many things in this code, including changing it to a lm model but I think I am missing something, so here I am. Apologies for the long post!

Kriti
  • 45
  • 1
  • 6
  • 2
    Make sure your data is actually a numeric data type. It sounds like your data might currently be a character of factor class. This can happen if you don't import your data correctly or if there are non-numeric values in a column for some reason (like an odd missing label). Or if you try to convert your data to a matrix with mixed data.types. It's easier to help you if you include a simple [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input that can be used to test and verify possible solutions. – MrFlick Jun 25 '21 at 03:56
  • If I copy/paste your example I don't get the same result. I see just one coefficient for Age. Are you sure this reproduces the problem? I note you don't have `data=pand_data` in your main question. Are you sure you are using the same data source. Did you verify column types using `str()` or do they just look numeric when printed? Also when using random samples for data, be sure to use `set.seed()` so we can generate the same random numbers for testing. – MrFlick Jun 25 '21 at 04:30
  • Yes, I am using the same data source. When I check the column types it shows all columns to be numeric and gender to be a character variable. That's why I'm finding this frustrating – Kriti Jun 25 '21 at 04:31
  • If you restart a new R session and copy/paste your example, do you still get the same result? It would be helpful to copy/paste the exact output you are seeing. You could also try sharing a dput of the glm object: `model <- glm(...); dput(model)`. That will make it very clear what's going on. – MrFlick Jun 25 '21 at 04:33

0 Answers0