NA value not excluded in lm() in R

Question

I have a dataframe with Sex (Female=1, Men=0), Race (white=1, non-white=0), among other columns. There are some missing values in both Sex and Race (both are factor variables). Below is a screenshot of the Sex variable distribution.

However, when I ran the linear regression, no missing values are dropped. Below is the regression output. As you can see, for some reason, both 0 and 1 show up for Sex and race. Does that mean R takes "NA" as the baseline? How can I fix the code so that lm() only takes in complete cases?

Why do you say that the NAs are not excluded? I do not see anything about NAs in the output. — G5W, Mar 24 '22 at 23:29
Can you please edit your question to include the code and results as text rather than as an image? — Ben Bolker, Mar 24 '22 at 23:30

score 1 · Accepted Answer · answered Mar 24 '22 at 23:32

I'm guessing that your "not available" data are coded as empty strings ("") rather than as NA values. R removes only NA values automatically. You could try

mydata$Sex[mydata$Sex == ""] <- NA

or

mydata$Sex <- factor(mydata$Sex, levels = c(0,1))

and try again ...

VYago · Answer 2 · 2022-03-24T23:49:08.423

you can remove all the rows with NAs with complete.cases:

all_nodes_group_merged.adj = all_nodes_group_merged[complete.cases(all_nodes_group_merged), ]

By the way I recommend to wrap factor vars as numeric:

lm(formula = Life_Satisfaction_6bp ~ as.numeric(Sex) + as.numeric(race_white) + item_count, data = all_nodes_group_merged.adj)

Factor vars in regression works in a special way, see : https://stackoverflow.com/a/30159530/11180223

Edit

You can also convert it to numeric and try if it makes some sense:

all_nodes_group_merged.adj$Sex_num = as.numeric(levels(all_nodes_group_merged.adj$Sex))[all_nodes_group_merged.adj$Sex]
all_nodes_group_merged.adj$race_white_num = as.numeric(levels(all_nodes_group_merged.adj$race_white))[all_nodes_group_merged.adj$race_white]

lm(formula = Life_Satisfaction_6bp ~ Sex_num + race_white_num + item_count, data = all_nodes_group_merged.adj)

NA value not excluded in lm() in R

2 Answers2