3

I have this data:

library(tidyverse)

df <- tibble(
  "racecmb" = c("White", "White", "White", "White", "White", "White", 
            "White", "White", "Black", "White", "Mixed", 
            "Black", "White", "White", "White"),
  "age" = c(77,74,55,62,60,59,32,91,75,73,43,67,58,18,57),
  "income" = c("10 to under $20,000", "100 to under $150,000", 
           "75 to under $100,000",  "75 to under $100,000",
           "10 to under $20,000", "20 to under $30,000",
           "100 to under $150,000", "20 to under $30,000",
           "100 to under $150,000", "20 to under $30,000",
           "100 to under $150,000", "Less than $10,000",
           "$150,000 or more", " 30 to under $40,000",
           "50 to under $75,000"),
  "party" = c("Independent", "Independent", "Independent", "Democrat", 
          "Independent", "Republican", "Independent", 
          "Independent", "Democrat", "Republican", "Republican", 
          "Democrat", "Democrat", "Independent", "Independent"),
 "ideology" = c("Moderate", "Moderate", "Conservative", "Moderate", 
             "Moderate", "Very conservative", "Moderate", 
             "Conservative", 
             "Conservative", "Moderate", "Conservative", 
             "Very conservative", "Liberal", "Moderate", "Conservative")
             )

I want (have tried) to run a simple multiple regression:

regression <- lm(party ~ income + ideo + age, data = df) %>%
   summary()

I get this error:

Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) : 
NA/NaN/Inf in 'y'

My goal is to explain the way some people vote, but I don't see how to effectively code the data for my model.

Any comments/suggestions are appreciated...

Nate
  • 10,361
  • 3
  • 33
  • 40
papelr
  • 468
  • 1
  • 11
  • 42
  • 2
    you seem to be trying to fit a linear model with a categorical response. That doesn't make much sense. Can you describe what you are trying to do? Also, make sure to share sample data in a [reproducible format](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) so it can be copy/pasted into R for testing. The spaces in your sample data make this hard. – MrFlick Jul 10 '18 at 21:02
  • So should I be using `glm()` instead of `lm()`? I guess that would make more sense – papelr Jul 10 '18 at 21:49
  • 1
    Also made it reproducible – papelr Jul 10 '18 at 22:03

1 Answers1

2

So to begin with, using lm() for categorical variables is not ideal. What you're looking to use is either rpart() which will give you output as categories or classes, or you can use multinomial logit/probit regression to return probabilities of outcomes occuring given some conditions.

Packages to install: rpart and statisticalModeling

If you did not have a categorical response variable, you could convert your categorical variables into dummy variables and then run your regression including your dummy variables (remember to leave one as baseline).

This can be quickly achieved using the fastDummies package:

Example: df <- dummy_cols(df, select_columns = "ideology")

If your sample size is considerable, then you may also want to consider interactions in your model between the dummied variables!

Adzil
  • 66
  • 8
  • Interactions can be coded with the `:` in the formula, right? And thank you for the help! – papelr Jul 11 '18 at 15:28
  • Also, when I try the dummy conversion, it just gives me the error: `Error in stopifnot(is.null(select_columns) || is.character(select_columns), : object 'ideology' not found` - is this because the variable is a factor? – papelr Jul 11 '18 at 15:39
  • 1
    I seem to have made a mistake in my code. The column name is supposed to be in quotation marks! So it is supposed to be: `df <- dummy_cols(df, select_columns = "ideology")`. Factors should be fine to pass through the function. Interactions can be written as `:` or `*`. The difference is that `*` is shorthand for `X1 + X2 + X1:X2`. So if you only want certain interactions using `:` is safer. – Adzil Jul 12 '18 at 00:52