0

I trained a binomial model using glm(Xtrain, ytrain, formula='cbind(Response, n - Response) ~ features', family='binomial'), where ytrain is a response matrix with columns of counts (yes), counts (no).

The test responses I've held out are in the same form of response matrix. However, the predict() function returns probabilities -- one for each row of the training data. I now want to use the ROCR or AUC package to generate AUC curves, but my prediction and observations are in different formats. Does anyone know how to do this?


OK. Adding an example. Forgive it being meaningless/rank deficient/small, I only want to illustrate my case.

plants <- c('Cactus', 'Tree', 'Cactus', 'Tree', 'Flower', 'Tree', 'Tree')
sun <- c('Full', 'Half', 'Half', 'Full', 'Full', 'Half', 'Full')
water <- c('N', 'Y', 'Y', 'N', 'Y', 'N', 'N')
died <- c(10, 10, 8, 2, 15, 20, 12)
didntdie <- c(2, 10, 8, 20, 10, 10, 10)
df <- data.frame(died, didntdie, plants, sun, water)
dftrain <- head(df, 5)
dftest <- tail(df, 2)
model <- glm("cbind(died, didntdie) ~ plants + sun + water", data=dftrain, family="binomial")

At this point, predict(model, dftest) returns the log-odds (giving a probability of death) for the final two sets of features in my dataframe. Now I wish to compute an AUC curve. My observations are in dftest[c('died','didntdie')]. My predictions are essentially probabilities. AUC, ROCR, etc expect both predictions and observations to be a list of bernoulli responses. I can't find documentation on how to use this response matrix instead. Any help appreciated.

C8H10N4O2
  • 18,312
  • 8
  • 98
  • 134
user
  • 621
  • 1
  • 9
  • 21
  • 2
    You should provide a [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) to make it easier to help you. What is `Xtrain`? The default `glm()` function doesn't have a `eqtn=` parameter. What are you comparing the probabilities to? Exactly how do you want to calculate the ROC with this data? – MrFlick Jun 08 '16 at 19:06

1 Answers1

3

For starters, you could expand the data frame to synthesize binary outcomes with counts that exploit the weight= argument to glm().

obs <- died + didntdie
df <- df[rep(1:length(obs), each= 2),] # one row for died and one for didn't
df$survived <- rep(c(0L,1L), times=length(obs)) # create binary outcome for survival
df$weight <- c(rbind(died, didntdie)) # assign weights
df

#     died didntdie plants  sun water survived weight
# 1     10        2 Cactus Full     N        0     10
# 1.1   10        2 Cactus Full     N        1      2
# 2     10       10   Tree Half     Y        0     10
# 2.1   10       10   Tree Half     Y        1     10
# 3      8        8 Cactus Half     Y        0      8
# 3.1    8        8 Cactus Half     Y        1      8
# 4      2       20   Tree Full     N        0      2
# 4.1    2       20   Tree Full     N        1     20
# 5     15       10 Flower Full     Y        0     15
# 5.1   15       10 Flower Full     Y        1     10
# 6     20       10   Tree Half     N        0     20
# 6.1   20       10   Tree Half     N        1     10
# 7     12       10   Tree Full     N        0     12
# 7.1   12       10   Tree Full     N        1     10

model <- glm(survived ~ plants + sun + water, data=df, family="binomial", weights = weight)

If you want to do the train/test split you will need to do another expansion, this time duplicating rows on weight. Otherwise your test set is not a random holdout, at least one randomized at the individual plant level, which may invalidate your results (depending on what you are trying to conclude).

Thus you would do something like

df <- df[rep(1:nrow(df), times = df$weight),]
model <- glm(survived ~ plants + sun + water, data=df, family="binomial") 
# note the model does not change

library(pROC)
auc(model$fitted.values, df$survived)
# Area under the curve: 0.5833

Note this is the in-sample AUC. You should use a randomized holdout (or better yet, cross-validation) to estimate out-of-sample AUC. Using the top N rows of the data.frame for the split is not a good idea unless the row order has already been randomized.

C8H10N4O2
  • 18,312
  • 8
  • 98
  • 134
  • Ah! Thanks for the example of how to expand the df. That will do it, I think. (Though, too bad ROCR doesn't have a parameter allowing you to enter other forms of data.) – user Jun 08 '16 at 21:25
  • @Erin in this example I showed you how to calculate it using `pROC::auc`. `ROCR::prediction` is too slow if all you want is AUC. ROCR is doing some other cool stuff under the hood. You can also write your own AUC function, which is what I use when working in `data.table` (not applicable here). – C8H10N4O2 Jun 08 '16 at 21:28
  • cool, thanks for the extra info on pROC. Don't worry about advising my holdout choices -- I know how I'm doing that part. – user Jun 08 '16 at 21:33