Multiple variable linear regression on election/census data & resulting errors

Question

I have this data:

library(tidyverse)

df <- tibble(
  "racecmb" = c("White", "White", "White", "White", "White", "White", 
            "White", "White", "Black", "White", "Mixed", 
            "Black", "White", "White", "White"),
  "age" = c(77,74,55,62,60,59,32,91,75,73,43,67,58,18,57),
  "income" = c("10 to under $20,000", "100 to under $150,000", 
           "75 to under $100,000",  "75 to under $100,000",
           "10 to under $20,000", "20 to under $30,000",
           "100 to under $150,000", "20 to under $30,000",
           "100 to under $150,000", "20 to under $30,000",
           "100 to under $150,000", "Less than $10,000",
           "$150,000 or more", " 30 to under $40,000",
           "50 to under $75,000"),
  "party" = c("Independent", "Independent", "Independent", "Democrat", 
          "Independent", "Republican", "Independent", 
          "Independent", "Democrat", "Republican", "Republican", 
          "Democrat", "Democrat", "Independent", "Independent"),
 "ideology" = c("Moderate", "Moderate", "Conservative", "Moderate", 
             "Moderate", "Very conservative", "Moderate", 
             "Conservative", 
             "Conservative", "Moderate", "Conservative", 
             "Very conservative", "Liberal", "Moderate", "Conservative")
             )

I want (have tried) to run a simple multiple regression:

regression <- lm(party ~ income + ideo + age, data = df) %>%
   summary()

I get this error:

Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) : 
NA/NaN/Inf in 'y'

My goal is to explain the way some people vote, but I don't see how to effectively code the data for my model.

Any comments/suggestions are appreciated...

you seem to be trying to fit a linear model with a categorical response. That doesn't make much sense. Can you describe what you are trying to do? Also, make sure to share sample data in a [reproducible format](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) so it can be copy/pasted into R for testing. The spaces in your sample data make this hard. — MrFlick, Jul 10 '18 at 21:02
So should I be using `glm()` instead of `lm()`? I guess that would make more sense — papelr, Jul 10 '18 at 21:49

Adzil · Accepted Answer · 2018-07-12T00:56:41.277

2

So to begin with, using lm() for categorical variables is not ideal. What you're looking to use is either rpart() which will give you output as categories or classes, or you can use multinomial logit/probit regression to return probabilities of outcomes occuring given some conditions.

Packages to install: rpart and statisticalModeling

If you did not have a categorical response variable, you could convert your categorical variables into dummy variables and then run your regression including your dummy variables (remember to leave one as baseline).

This can be quickly achieved using the fastDummies package:

Example: df <- dummy_cols(df, select_columns = "ideology")

If your sample size is considerable, then you may also want to consider interactions in your model between the dummied variables!

edited Jul 12 '18 at 00:56

answered Jul 11 '18 at 09:02

Adzil

66
8

Interactions can be coded with the `:` in the formula, right? And thank you for the help! – papelr Jul 11 '18 at 15:28
Also, when I try the dummy conversion, it just gives me the error: `Error in stopifnot(is.null(select_columns) || is.character(select_columns), : object 'ideology' not found` - is this because the variable is a factor? – papelr Jul 11 '18 at 15:39
1

I seem to have made a mistake in my code. The column name is supposed to be in quotation marks! So it is supposed to be: `df <- dummy_cols(df, select_columns = "ideology")`. Factors should be fine to pass through the function. Interactions can be written as `:` or `*`. The difference is that `*` is shorthand for `X1 + X2 + X1:X2`. So if you only want certain interactions using `:` is safer. – Adzil Jul 12 '18 at 00:52

Multiple variable linear regression on election/census data & resulting errors

1 Answers1