Why am I losing categorical data in my regression summary?

Question

box <- read.csv("BlackBoxtrainApril22.csv")

#Change the 2 categorical variables into factors
box$SOUND <- as.factor(box$SOUND)
box$SWITCH <- as.factor(box$SWITCH)

#divide training and testing data
train <- box[1:12000,]
test <- box[12001:18048,]

library(nnet)
require(nnet)
multinom_model <- multinom(SOUND ~ ., data=box)
summary(multinom_model)

Here's some output from dput(head(box)) to see what the data looks like:

structure(list(ID = c(86623L, 57936L, 54301L, 2678L, 65827L, 22420L), INPUT1 = c(30L, 87L, 16L, 64L, 33L, 5L), INPUT2 = c(31L, 76L, 33L, 77L, 72L, 50L), INPUT3 = c(72L, 31L, 87L, 91L, 53L, 26L), INPUT4 = c(29L, 79L, 41L, 59L, 66L, 50L), SWITCH = c("Low", "Low", "Low", "Minimum", "High", "High"), SOUND = c("Gargle", "Tick", "Tick", "Beep", "Beep", "Gargle")), row.names = c(NA, 6L), class = "data.frame")

In essence, I'm trying to predict a categorical variable using a combination of numeric and categorical data. This is my code. When I do a summary, I lose one of the SWITCH categories and one of SOUND categories. I think it has something to do with reference variables, but I'm not exactly sure.

Welcome to SO, AriMorrison! Realize that we have no idea what is in the data, so it's unlikely we can help at all. Please provide a sample of the data by posting the output from `dput(head(box))`; if there are a lot of columns, then perhaps `dput(box[1:10,1:5])` or some specific subset of rows and columns that well-represents the data. See https://stackoverflow.com/q/5963269, [mcve], and https://stackoverflow.com/tags/r/info for good examples on asking questions in a reproducible way. — r2evans, Apr 29 '21 at 00:52
Thanks. It's my first time posting on SO, so appologies for that. Here's the output from dput(head(box)): structure(list(ID = c(86623L, 57936L, 54301L, 2678L, 65827L, 22420L), INPUT1 = c(30L, 87L, 16L, 64L, 33L, 5L), INPUT2 = c(31L, 76L, 33L, 77L, 72L, 50L), INPUT3 = c(72L, 31L, 87L, 91L, 53L, 26L), INPUT4 = c(29L, 79L, 41L, 59L, 66L, 50L), SWITCH = c("Low", "Low", "Low", "Minimum", "High", "High"), SOUND = c("Gargle", "Tick", "Tick", "Beep", "Beep", "Gargle")), row.names = c(NA, 6L), class = "data.frame") — Ari Morrison, Apr 29 '21 at 17:01
(Please [edit] your question and put it there, don't post it in a comment. Thanks!) — r2evans, Apr 29 '21 at 17:02

score 1 · Answer 1 · answered Apr 29 '21 at 00:49

1

You're right about the reference categories. When you include a categorical/factor variable in a model, one category of the variable is always excluded and serves as the reference category. The estimates for the categories that you do see in the output are in reference to the category that was excluded. For example, if you have a factor variable with categories "red", "blue", and "green", and "red" is the reference category, then the model estimates for "blue" and "green" will be for "blue" vs "red" and "green" vs "red", respectively.

answered Apr 29 '21 at 00:49

Alec B

159
3

1

this should be a FAQ ... anyone know how to find the appropriate duplicates easily? – Ben Bolker Apr 29 '21 at 18:20
1

@BenBolker I closed a similar question earlier today with [this dupe](https://stackoverflow.com/questions/41032858/lm-summary-not-display-all-factor-levels), but I seem to remember seeing a less technical answer somewhere else – Allan Cameron May 23 '22 at 21:36

Why am I losing categorical data in my regression summary?

1 Answers1

Linked

Related