0

The dataset I am working with is taken from the Current Population Survey on IPUMS, and it has around 1,716,121 observations of 13 variables. I am trying to run a cross-validation on this data and then graph the resulting AUC.

The model I am using is a logistic regression and my dependent variable is a binary variable (has either a value of 0 or 1). Whenever I run the code, I get the warning:

In bind_rows_(x, .id) : Vectorizing 'labelled' elements may not preserve their attributes.

I am not sure what this means.

I also get the errors:

Error in select(., .id, outcome, pred) : unused arguments (.id, outcome, pred)"

and

Error in summarise_impl(.data, dots) : Evaluation error: object 'outcome' not found.

If someone could help me with this, it would be greatly appreciated!

My code is:

    mod1_formula<-formula("self_employ~
    as.factor(educ_level)+
    as.factor(SEX)+
    as.factor(RACE)+
    as.factor(NCHILD)")
    cps_data %>%
    crossv_kfold(k=2) %>%
    mutate(model = purrr::map(train, ~glm(mod1_formula, data=., 
    family=binomial))) -> trained.models
    trained.models %>%
    unnest( pred = map2( model, test, ~predict( .x, .y, type = 
    "response")) ) -> test.predictions`
    trained.models %>%
    unnest( fitted = map2(model, test, ~augment(.x, newdata = 
    .y)),
    pred = map2( model, test, ~predict( .x, .y, type = 
    "response")) ) -> test.predictions
    test.predictions %>% select(.id, outcome, pred )
    test.predictions %>%
    group_by(.id) %>%
    summarize(auc = roc(outcome, .fitted)$auc) %>%
    select(auc)
    gg <- ggplot(data=test.predictions, aes(x= auc))
    gg <- gg+geom_histogram()
    gg
Graham
  • 7,431
  • 18
  • 59
  • 84
Cindy Ni
  • 3
  • 2
  • 2
    It's easier to help you if you include a simple [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input and desired output that can be used to test and verify possible solutions. – MrFlick Dec 07 '18 at 19:36

1 Answers1

0

Based on the warning message, I believe the problem is due to ipumsr's use of labelled values instead of R's factors. In particular, you probably need to convert to factors before running your regressions instead of putting as.factor in the formula, (also as.factor doesn't get the labels, instead use as_factor). More information in the value-labels vignette.

I appreciate that the IPUMS licensing restrictions make it hard for you to post a full reproducible example, which the community here expects (you have posted your code which is a good first step, but without the data we cannot fully reproduce it). You could subset a small number of rows to see if you get the same error message and post that data. Otherwise it may be easier if you post to the IPUMS forum (http://answers.popdata.org/) where the IPUMS staff are able to access your extract and so you may get help faster.

GregF
  • 1,292
  • 11
  • 14