The dataset I am working with is taken from the Current Population Survey on IPUMS, and it has around 1,716,121 observations of 13 variables. I am trying to run a cross-validation on this data and then graph the resulting AUC.
The model I am using is a logistic regression and my dependent variable is a binary variable (has either a value of 0 or 1). Whenever I run the code, I get the warning:
In bind_rows_(x, .id) : Vectorizing 'labelled' elements may not preserve their attributes.
I am not sure what this means.
I also get the errors:
Error in select(., .id, outcome, pred) : unused arguments (.id, outcome, pred)"
and
Error in summarise_impl(.data, dots) : Evaluation error: object 'outcome' not found.
If someone could help me with this, it would be greatly appreciated!
My code is:
mod1_formula<-formula("self_employ~
as.factor(educ_level)+
as.factor(SEX)+
as.factor(RACE)+
as.factor(NCHILD)")
cps_data %>%
crossv_kfold(k=2) %>%
mutate(model = purrr::map(train, ~glm(mod1_formula, data=.,
family=binomial))) -> trained.models
trained.models %>%
unnest( pred = map2( model, test, ~predict( .x, .y, type =
"response")) ) -> test.predictions`
trained.models %>%
unnest( fitted = map2(model, test, ~augment(.x, newdata =
.y)),
pred = map2( model, test, ~predict( .x, .y, type =
"response")) ) -> test.predictions
test.predictions %>% select(.id, outcome, pred )
test.predictions %>%
group_by(.id) %>%
summarize(auc = roc(outcome, .fitted)$auc) %>%
select(auc)
gg <- ggplot(data=test.predictions, aes(x= auc))
gg <- gg+geom_histogram()
gg