0

I am working with a variable for race that takes on the following values:1 Black, 2 Hispanic, 3 Mixed Race (Non-Hispanic), 4 Non-Black / Non-Hispanic. I want to sum up 3 and 4 and make it the base category and keep Black and Hispanic. I tried to create 2 dummies (Black=1 and other Hispanic=1) and 2 extra columns are created, but the values in them are not 1 and 0, but False and True. The code I used:

nlsy2$Hispanic <- nlsy2$Race==2
nlsy2$Black <- nlsy2$Race==1
nlsy2$Race [ nlsy2$Race == 0 ] <- 3
nlsy2$Race [ nlsy2$Race == 0 ] <- 4

Also when I run summary(nlsy2$Hispanic) R gives me this output:

   Mode   FALSE    TRUE    NA's 
logical    5594    1526       0 

Are the NA's problematic when running a glm? Also, if you have a better code solution in how I can recode the race variable, it would be much appreciated! Thank you!

David Heckmann
  • 2,899
  • 2
  • 20
  • 29
bree
  • 25
  • 1
  • 7
  • try `nlsy2$Hispanic <- (nlsy2$Race == 2) + 0` – Adam Quek Apr 24 '17 at 02:52
  • Also, please provide a [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) – Adam Quek Apr 24 '17 at 02:52
  • Try grouping the categories through `levels` function in R , refer to [link] http://stackoverflow.com/questions/9604001/grouping-2-levels-of-a-factor-in-r , and why do you need to convert to dummy for modelling and not use them `as.factor`? For NA you can always include `na.action = na.exclude` in your code and based on data you can always consider imputing it using `mice` package – Learner_seeker Apr 24 '17 at 02:56
  • @Adam Quek: Yes! Thank you the NA disappears for Hispanic :D – bree Apr 24 '17 at 04:59

1 Answers1

0

Does

nlsy$Race[nlsy$Race == 3 | nlsy$Race == 4] <- 0
nlsy$Race <- factor(nlsy$Race)

not do the job? You're going to want it in factors rather than numeric when doing any modelling because these are categorical and you don't want to risk them being interpreted as numeric.

shians
  • 955
  • 1
  • 6
  • 21
  • The code you posted worked. I have a general question about indicating race as a factor: If running a logistic regression of marital status on race etc. how would it affect my coefficients/results? The reason I am asking is because I have more categorical variables: gender, degree etc. gender is binary but degree has several categories (from no education to PhD). I had no problem to run a logistic regression, I am just wondering now, how it could affect my results if I am not telling R my categorical variables to be factors. Sorry if this is a trivial question. – bree Apr 24 '17 at 03:17
  • Well in binary it probably won't matter, but for categorical it doesn't really make sense for Hispanic to be 2 and Black to be 1, since there's no reason to expect Hispanic to be twice as much "race" as Black. For education the model should be even more complicated, because they should be ordered but not necessarily linear. For the coefficients you should notice that if you used categorical you would get a coefficient for each of Black and Hispanic, but had it been numerical you would get one coefficient under "Race" which would also that Hispanics have twice as much "Race" as Blacks. – shians Apr 24 '17 at 04:30
  • Tnx, I get the example with race, that is why I created a dummy for Blacks & Hispanic instead of having one variable "Race". Now I am thinking about how to solve the problem with education. Degree=0 for someone with no education, degree=1 for HS ... up to degree=4 for PhD. The variable is ordered and because I am running a logistic regression, I thought that it is more flexible and there is no assumption on how exactly the independent variable (degree) and dependent variable (marital status) are related. Would you suggest to create dummies for each degree level, to improve the model? – bree Apr 24 '17 at 05:07
  • It's been a while since I've done anything with ordinals but `x <- factor(x, ordered = TRUE)` should do the trick, as long as your original levels were in the right order, otherwise just specify them with `levels = c(...)` argument in factor factor. – shians Apr 24 '17 at 07:57
  • it works like a charm to indicate the categorical variables as factors. However, R has issue with the variable race and I get: Error in cor(nlsy2) : 'x' must be numeric. I get that there is no correlation between 2 categorical variables or a categorical and a continuous one. How would you optimize the command to get correlation for all variables in your dataset (it is only 9 variables) for the ones that are continuous – bree Apr 24 '17 at 14:46