2

I have a big dataset that I want to partition based on the values of a particular variable (in my case lifetime), and then run logistic regression on each partition. Following the answer of @tchakravarty in Fitting several regression models with dplyr I wrote the following code:

lifetimemodels = data %>% group_by(lifetime) %>% sample_frac(0.7)%>%
     do(lifeModel = glm(churn ~., x= TRUE, family=binomial(link='logit'), data = .))

My question now is how I can use the resulting logistic models on computing the AUC on the rest of the data (the 0.3 fraction that was not chosen) which should again be grouped by lifetime?

Thanks a lot in advance!

Community
  • 1
  • 1
morfara
  • 190
  • 3
  • 16
  • 1
    Introduce a column `training = sample(c(T, F), size = n(), prob = c(0.3,0.7), replace = TRUE)`, Then withhold those rows from `glm` where `training == TRUE`. – AlexR Jan 10 '17 at 19:39

1 Answers1

6

You could adapt your dplyr approach to use the tidyr and purrr framework. You look at grouping/nesting, and the mutate and map functions to create list frames to store pieces of your workflow.

The test/training split you are looking for is part of modelr a package built to assist modelling within the purrr framework. Specifically the cross_vmc and cross_vkfold functions.

A toy example using mtcars (just to illustrate the framework).

library(dplyr)
library(tidyr)
library(purrr)
library(modelr)

analysis <- mtcars %>%
  nest(-cyl) %>%
  unnest(map(data, ~crossv_mc(.x, 1, test = 0.3))) %>%
  mutate(model = map(train, ~lm(mpg ~ wt, data = .x))) %>%
  mutate(pred = map2(model, train, predict)) %>%
  mutate(error = map2_dbl(model, test, rmse))

This:

  1. takes mtcars
  2. nest into a list frame called data by cyl
  3. Separate each data into a training set by mapping crossv_mc to each element, then using unnest to make the test and train list columns.
  4. Map the lm model to each train, store that in model
  5. Map the predict function to model and train and store in pred
  6. Map the rmse function to model and test sets and store in error.

There are probably users out there more familiar than me with the workflow, so please correct/elaborate.

Jake Kaupp
  • 7,892
  • 2
  • 26
  • 36
  • Thanks for the very nice answer! I am actually looking to get an AUC because I have a classification problem. Do you have any idea how to get AUC instead of rmse? The way I am doing this in one dataset is the following (using the AUC package), but I don't know if I can adapt it easily to your solution: `p1 <- predict(model, newdata = testSet, type="response") pr1 <- prediction(p1, testSet$dependentvariable) prf1 <- performance(pr1, measure = "tpr", x.measure = "fpr") auc1 <- performance(pr1, measure = "auc") auc1 <- auc1@y.values[[1]] auc1` – morfara Jan 11 '17 at 00:23
  • It'd require a `unnest(map(testSet, ~select(.x, "dependentvariable)))` then repeating the `mutate(var = map(col1, col2, fun))` to apply the auc measures. – Jake Kaupp Jan 11 '17 at 02:02
  • I am a little lost now. Unnest after the pred step of your solution? and for your "fun" argument can I type "auc"? – morfara Jan 11 '17 at 02:06
  • You'll need to extend the chain, or start from `analysis` and to apply your AUC measures. Look into how to use `tidyr` and `purrr` in this fashion, [here](https://jennybc.github.io/purrr-tutorial/) and [here](https://blog.rstudio.org/2016/02/02/tidyr-0-4-0/) – Jake Kaupp Jan 11 '17 at 14:43
  • Thanks a lot for the references. I reached my code to the following level: `analysis <- data %>% nest(-lifetime) %>% unnest(map(data, ~crossv_mc(.x, 1, test = 0.3))) %>% mutate(model = map(train, ~glm(churn ~., x= TRUE, family=binomial(link='logit'), data = .x))) %>% mutate(pred = map2(model, test, predict, type="response")) %>% unnest(map(test, ~select(.x, dependentvariable)))` But when I try to unnest, in order to select the dependent variable I get the following error: "Error: no applicable method for 'select_' applied to an object of class "resample"". Any ideas? – morfara Jan 13 '17 at 06:33
  • @Jake Kaupp: I run your workflow and got this error `Error: `.x` must be a vector, not a function Run `rlang::last_error()` to see where the error occurred. In addition: Warning message: All elements of `...` must be named. Did you want `data = c(mpg, disp, hp, drat, wt, qsec, vs, am, gear, carb)`? ` Do you know what could have caused it? I know that the question and answers were posted in 2017 and packages have been updated but I am lost. – hnguyen Jan 12 '21 at 17:31