I want to run a series of linear regressions on subsets of a dataframe (subset by about 100 occupations). One of the explanatory variables is a factor. For some subsets (ie occupations), this factor only has one level, but it is very important that I include where I can. When I split the data and map the regression, I get an error about contrasts. I know why this is happening, but is there a way to attempt to use explantory categorical variables in a formula but drop them if they are only one level?
I've seen the post about debugging this error, but that does not account for the mapping component that I'm attempting.
library(tidyverse)
# Create reprex data
# Here, there are only male plumbers, which will cause a problem later
df <- tibble(wage = rnorm(10, 100, 15),
occupation = c(rep("Plumber", 5),
rep("Electrician", 5)),
hours = rnorm(10, 40, 5),
sex = c(rep("Male", 5),
rep("Male", 2),
rep("Female", 3)))
glimpse(df)
#> Observations: 10
#> Variables: 4
#> $ wage <dbl> 107.69546, 117.79401, 102.75925, 108.66250, 100.716...
#> $ occupation <chr> "Plumber", "Plumber", "Plumber", "Plumber", "Plumbe...
#> $ hours <dbl> 51.73202, 37.13047, 38.20627, 41.00303, 39.14806, 3...
#> $ sex <chr> "Male", "Male", "Male", "Male", "Male", "Male", "Ma...
# Split the df by occupation and run a regression to explain wages
df %>%
split(.$occupation) %>%
map(~lm(wage ~ hours + sex,
data = .))
#> Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]): contrasts can be applied only to factors with 2 or more levels
Created on 2019-08-23 by the reprex package (v0.3.0)
I know why the contrasts error is occuring (becuase there are only males in the plumbers split), but is there a way to wrap 'sex' in something so that it's used if it can be, and dropped if it cannot? Or is there some syntex other than split
& map
that I can use to do what I want?
Thanks.