1

I'm working out the output of a model generated with glm. The model output is stored in a nested tibble. I want to calculate confidence interval through the transformation from type = "link" to inverse-link (using $family$linkinv). However, I can't get it to work with dplyr::mutate in a nested tibble because the way of pulling the $family$linkinv is from the model object using model$family$linkinv(x), which seems not to work as intended in the nested format.

Background

This current question is based on a previous question (and chosen answer) I posted about testing the level of liking fruits by different predictors using a linear model. I conduct a research to figure out which fruit is more likeable: mango, banana, or apple. To this end, I go ahead and sample 100 people at random. I ask them to rate, on a scale of 1-5, the degree of liking each of the fruits.

While the previous question had to do with lm, here I'm trying to utilize quasibinomial glm. The issue is that I want to get confidence intervals but my method (glm %>% predict) outputs SE in "link space", therefore I have to go through a conversion process (detailed in this SO answer) to get what I want.

Data

library(tidyverse)
library(magrittr)

set.seed(123)

fruit_liking_df <-
  data.frame(
    id = 1:100,
    i_love_apple = sample(c(1:5), 100, replace = TRUE),
    i_love_banana = sample(c(1:5), 100, replace = TRUE),
    i_love_mango = sample(c(1:5), 100, replace = TRUE),
    age = sample(c(20:70), 100, replace = TRUE),
    is_male = sample(c(0, 1), 100, prob = c(0.2, 0.8), replace = TRUE),
    education_level = sample(c(1:4), 100, replace = TRUE),
    is_colorblinded = sample(c(0, 1), 100, replace = TRUE)
  )

> as_tibble(fruit_liking_df)

## # A tibble: 100 x 8
##       id i_love_apple i_love_banana i_love_mango   age is_male education_level is_colorblinded
##    <int>        <int>         <int>        <int> <int>   <dbl>           <int>           <dbl>
##  1     1            3             5            2    50       1               2               0
##  2     2            3             3            1    49       1               1               0
##  3     3            2             1            5    70       1               1               1
##  4     4            2             2            5    41       1               3               1
##  5     5            3             1            1    49       1               4               0
##  6     6            5             2            1    29       0               1               0
##  7     7            4             5            5    35       1               3               0
##  8     8            1             3            5    24       0               3               0
##  9     9            2             4            2    55       1               2               0
## 10    10            3             4            2    69       1               4               0
## # ... with 90 more rows

I want to test my data in a per cent scale, so first I transform it by subtracting 1 and then dividing by 4:

fruit_liking_df %<>%
  mutate_at(vars(starts_with("i_love_")), ~ subtract(., 1) %>% divide_by(., 4))

> as_tibble(fruit_liking_df)

## # A tibble: 100 x 8
##       id i_love_apple i_love_banana i_love_mango   age is_male education_level is_colorblinded
##    <int>        <dbl>         <dbl>        <dbl> <int>   <dbl>           <int>           <dbl>
##  1     1         0.5           1            0.25    50       1               2               0
##  2     2         0.5           0.5          0       49       1               1               0
##  3     3         0.25          0            1       70       1               1               1
##  4     4         0.25          0.25         1       41       1               3               1
##  5     5         0.5           0            0       49       1               4               0
##  6     6         1             0.25         0       29       0               1               0
##  7     7         0.75          1            1       35       1               3               0
##  8     8         0             0.5          1       24       0               3               0
##  9     9         0.25          0.75         0.25    55       1               2               0
## 10    10         0.5           0.75         0.25    69       1               4               0
## # ... with 90 more rows


Now I use a pipe to run the glm model for each fruit, get SE in link space, and convert SE to CI

## will be needed later
my_new_data_for_pred <- expand_grid(
  age = 45,
  is_male = .5,
  education_level = 2.5,
  is_colorblinded = 0.5
)

## will be needed later
critval <- 1.96

model_fits_grouped <-
  fruit_liking_df %>%
  pivot_longer(starts_with("i_love"), values_to = "fruit") %>%
  group_by(name) %>%
  tidyr::nest() %>%
  mutate(model_fit = map(
    data,
    ~ glm(
      data = .x,
      fruit ~ I(age - 45) +
        I((age - 45) ^ 2) +
        I(is_male - .5) +
        I(education_level - 2) +
        is_colorblinded,
      family = quasibinomial
    )
  )) %>%
  mutate(predicted_values = map(
    model_fit,
    ~ bind_cols(my_new_data_for_pred,
                as.data.frame(
                  predict(
                    newdata = my_new_data_for_pred,
                    .x,
                    type = "link",
                    interval = "confidence",
                    level = 0.95,
                    se.fit = T
                  )
                )) %>%
      rowwise() %>%
      mutate(
        estimate =  fit,
        lower_ci_link =  fit - critval * se.fit,
        upper_ci_link = fit + critval * se.fit
      )
  ))

> model_fits_grouped

## # A tibble: 3 x 4
## # Groups:   name [3]
##   name          data               model_fit predicted_values 
##   <chr>         <list>             <list>    <list>           
## 1 i_love_apple  <tibble [100 x 6]> <glm>     <tibble [1 x 10]>
## 2 i_love_banana <tibble [100 x 6]> <glm>     <tibble [1 x 10]>
## 3 i_love_mango  <tibble [100 x 6]> <glm>     <tibble [1 x 10]>

Unnesting the predicted_values gets:

> model_fits_grouped %>% unnest(predicted_values)

## # A tibble: 3 x 13
## # Groups:   name [3]
##   name          data              model_fit   age is_male education_level is_colorblinded     fit se.fit residual.scale estimate lower_ci_link upper_ci_link
##   <chr>         <list>            <list>    <dbl>   <dbl>           <dbl>           <dbl>   <dbl>  <dbl>          <dbl>    <dbl>         <dbl>         <dbl>
## 1 i_love_apple  <tibble [100 x 6~ <glm>        45     0.5             2.5             0.5  0.0843  0.261          0.709   0.0843        -0.427         0.595
## 2 i_love_banana <tibble [100 x 6~ <glm>        45     0.5             2.5             0.5 -0.0718  0.286          0.781  -0.0718        -0.633         0.489
## 3 i_love_mango  <tibble [100 x 6~ <glm>        45     0.5             2.5             0.5 -0.140   0.279          0.762  -0.140         -0.687         0.407

Here's the problem: Now I want to mutate two more columns within predicted_values for the inverse-link transformation for lower_ci_link and upper_ci_link, but this fails

model_fits_grouped <-
  fruit_liking_df %>%
  pivot_longer(starts_with("i_love"), values_to = "fruit") %>%
  group_by(name) %>%
  tidyr::nest() %>%
  mutate(model_fit = map(
    data,
    ~ glm(
      data = .x,
      fruit ~ I(age - 45) +
        I((age - 45) ^ 2) +
        I(is_male - .5) +
        I(education_level - 2) +
        is_colorblinded,
      family = quasibinomial
    )
  )) %>%
  mutate(predicted_values = map(
    model_fit,
    ~ bind_cols(my_new_data_for_pred,
                as.data.frame(
                  predict(
                    newdata = my_new_data_for_pred,
                    .x,
                    type = "link",
                    interval = "confidence",
                    level = 0.95,
                    se.fit = T
                  )
                )) %>%
      rowwise() %>%
      mutate(
        estimate =  fit,
        lower_ci_link =  fit - critval * se.fit,
        upper_ci_link = fit + critval * se.fit
      ) %>%
######################### this addition fails ###########################
      mutate(
        lower_ci_inverse_link = model_fit$family$linkinv(lower_ci_link),
        upper_ci_inverse_link = model_fit$family$linkinv(upper_ci_link)
      )
#########################################################################
  ))

And I get:

Error: Problem with mutate() input predicted_values. x Problem with mutate() input lower_ci_inverse_link. x attempt to apply non-function i Input lower_ci_inverse_link is model_fit$family$linkinv(lower_ci_link). i The error occurred in row

  1. i Input predicted_values is map(...). i The error occurred in row 1.

I assume that the problem is that I'm trying to mutate new columns within predicted_values, but using model_fit$family$linkinv(lower_ci_link) refers to model_fit which is in a higher level in the nested tibble.

Bottom line question

How can I get to mutate inverse-link columns within predicted_values using model_fit$family$linkinv(lower_ci_link) and model_fit$family$linkinv(upper_ci_link) to ultimately get (scroll all the way to the rightest two columns):

> model_fits_grouped %>% unnest(predicted_values)

## # A tibble: 3 x 15
## # Groups:   name [3]
##   name          data               model_fit   age is_male education_level is_colorblinded   fit se.fit residual.scale estimate lower_ci_link upper_ci_link lower_ci_inverse_link_*DEMO* upper_ci_inverse_link_*DEMO*
##   <chr>         <list>             <list>    <dbl>   <dbl>           <dbl>           <dbl> <dbl>  <dbl>          <dbl>    <dbl>         <dbl>         <dbl>                      <dbl>                      <dbl>
## 1 i_love_apple  <tibble [100 x 6]> <glm>        45     0.5             2.5             0.5 0.521 0.0632          0.349    0.521         0.397         0.645                      0.111                      0.111
## 2 i_love_banana <tibble [100 x 6]> <glm>        45     0.5             2.5             0.5 0.482 0.0701          0.387    0.482         0.345         0.620                      0.222                      0.222
## 3 i_love_mango  <tibble [100 x 6]> <glm>        45     0.5             2.5             0.5 0.465 0.0683          0.377    0.465         0.331         0.599                      0.333                      0.333

APPENDIX


DEMONSTRATION OF HOW I'M ABLE TO GET WHAT I WANT WITHOUT A PIPE OR DATAFRAME

The following method relies on assigning variables for several steps along the way. For the sake of demonstration, it shows how to run the model and get the $family$linkinv for just one fruit.

Data

As before, it's fruit_liking_df after doing the arithmetic transformation to decimals, hence:

> as_tibble(fruit_liking_df)

## # A tibble: 100 x 8
##       id i_love_apple i_love_banana i_love_mango   age is_male education_level  is_colorblinded
##    <int>        <dbl>         <dbl>        <dbl> <int>   <dbl>           <int>            <dbl>
##  1     1         0.5           1            0.25    50       1               2                0
##  2     2         0.5           0.5          0       49       1               1                0
##  3     3         0.25          0            1       70       1               1                1
##  4     4         0.25          0.25         1       41       1               3                1
##  5     5         0.5           0            0       49       1               4                0
##  6     6         1             0.25         0       29       0               1                0
##  7     7         0.75          1            1       35       1               3                0
##  8     8         0             0.5          1       24       0               3                0
##  9     9         0.25          0.75         0.25    55       1               2                0
## 10    10         0.5           0.75         0.25    69       1               4                0
## # ... with 90 more rows

Model

I'm going to focus only on the i_love_apple column data, and run glm on it.

my_model <-
  glm(
    i_love_apple ~ 
      I(age - 45) + 
      I((age - 45) ^ 2) + 
      I(is_male - 0.5)  + 
      I(education_level - 2) + 
      I(is_colorblinded - 0.5),
    family = quasibinomial,
    data = fruit_liking_df
  )

Prediction

Now I run predict() on my_model using prediction data from my_new_data_for_pred:

prediction_link_type <- 
  predict(object = my_model,
          newdata = my_new_data_for_pred,
          type = "link",   ## <------------ type = "link" is crucial to note
          interval = "confidence",
          level = 0.95,
          se.fit = TRUE)


> prediction_link_type

## $fit
##          1 
## 0.08427577 

## $se.fit
## [1] 0.2606326

## $residual.scale
## [1] 0.7090294

Now I convert from the SE measure I got in prediction_link_type to confidence interval (CI) by multiplying the SE by critval (which has been assigned with 1.96). I assign two separate vectors: one with the upper bound CI, and another with the lower bound CI:

lower_ci_link <- prediction_link_type$fit - (critval * prediction_link_type$se.fit)
upper_ci_link <- prediction_link_type$fit + (critval * prediction_link_type$se.fit)

Almost there! I got the CI values but they are in "link" space (because predict() used type = "link"). To convert the CI values back from "link", I use inverse-link function:

lower_ci_inverse_link <- my_model$family$linkinv(lower_ci_link)
upper_ci_inverse_link <- my_model$family$linkinv(upper_ci_link)

In Summary

Although this "vectors" method gets the job done, it is not what I'm looking for. Instead, I want to incorporate the conversion of "link -> SE -> CI -> inverselink" through the pipe introduced in the beginning of this question.

Emman
  • 3,695
  • 2
  • 20
  • 44
  • Would it helpt to add a rowwise() %>% before the mutate? This causes the operation to be evaluated row by row instead of for the whole column and therefore you have a value instead of the vector as input. – pieterbons Aug 24 '20 at 10:24
  • thanks, @pieterbons. I tried adding an additional `rowwise()` before the broken `mutate`, following your suggestion but it won't work :/ – Emman Aug 24 '20 at 10:29
  • each value `predicted_values` is a list of 3. `fit`, `se.fit` and `residual.scale`. All 3 of them are numeric vector. There is no `fit$family` present in the first place hence you are getting that an error. Can you show how you run this with an independent example instead of this dataframe? – Ronak Shah Aug 25 '20 at 03:53
  • thanks, @RonakShah! I've updated the question per your request. I hope this make it clearer. – Emman Aug 25 '20 at 06:59
  • I ended up revising the entire question because of wrong code,. – Emman Aug 25 '20 at 11:38
  • Your code still does not work and you still have brackets at wrong place. Check the closing bracket of `map(model_fit....`. – Ronak Shah Aug 25 '20 at 14:46
  • @RonakShah, sorry, it's frustrating. Yes, I omitted the `critval` and `my_new_data_for_pred` assignment. I corrected it now. However, the brackets seem to be ok. I tried this code on a different machine and it runs well. So it's only about the `$linkinv` issue now. – Emman Aug 25 '20 at 19:11

1 Answers1

1

To refer to the data passed in map you need to use .x. Try the below answer.

library(tidyverse)

result <- fruit_liking_df %>%
  pivot_longer(starts_with("i_love"), values_to = "fruit") %>%
  group_by(name) %>%
  tidyr::nest() %>%
  mutate(model_fit = map(
    data,
    ~ glm(
      data = .x,
      fruit ~ I(age - 45) +
        I((age - 45) ^ 2) +
        I(is_male - .5) +
        I(education_level - 2) +
        is_colorblinded,
      family = quasibinomial
    )
  )) %>%
  mutate(predicted_values = map(
    model_fit,
    ~ bind_cols(my_new_data_for_pred,
                as.data.frame(
                  predict(
                    newdata = my_new_data_for_pred,
                    .x,
                    type = "link",
                    interval = "confidence",
                    level = 0.95,
                    se.fit = T
                  )
                )) %>%
      rowwise() %>%
      mutate(
        estimate =  fit,
        lower_ci_link =  fit - critval * se.fit,
        upper_ci_link = fit + critval * se.fit,
        lower_ci_inverse_link = .x$family$linkinv(lower_ci_link),
        upper_ci_inverse_link = .x$family$linkinv(upper_ci_link)
    )))

result looks like :

result
# name          data               model_fit predicted_values 
#  <chr>         <list>             <list>    <list>           
#1 i_love_apple  <tibble [100 × 6]> <glm>     <tibble [1 × 12]>
#2 i_love_banana <tibble [100 × 6]> <glm>     <tibble [1 × 12]>
#3 i_love_mango  <tibble [100 × 6]> <glm>     <tibble [1 × 12]>

To get all the values as separate columns you can use unnest_wider :

result %>% unnest_wider(predicted_values)

#  name  data  model_fit   age is_male education_level is_colorblinded     fit se.fit
#  <chr> <lis> <list>    <dbl>   <dbl>           <dbl>           <dbl>   <dbl>  <dbl>
#1 i_lo… <tib… <glm>        45     0.5             2.5             0.5  0.0843  0.261
#2 i_lo… <tib… <glm>        45     0.5             2.5             0.5 -0.0718  0.286
#3 i_lo… <tib… <glm>        45     0.5             2.5             0.5 -0.140   0.279
# … with 6 more variables: residual.scale <dbl>, estimate <dbl>, lower_ci_link <dbl>,
#   upper_ci_link <dbl>, lower_ci_inverse_link <dbl>, upper_ci_inverse_link <dbl>
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
  • Yes, it works! Thanks! But do you see why this works for you but didn't in my case? I fail to find the key difference. I've revised the code in the question so now it's running smoothly (except for the broken lines the question is about). – Emman Aug 25 '20 at 10:01
  • 1
    The key difference which I can immediately spot are 1) I don't know why you are doing `bind_cols(my_new_data_for_pred,...` I think it fails for me because of that at that line. 2) When you are doing `estimate = fit` or `lower_ci_link = fit - critval * se.fit` it doesn't know what `fit` or `se.fit` is. You probably mean to use `predicted_values$fit` and same for `se.fit`. In my answer I use `map_dbl` to extract those values and include it as separate column but you don't have that column so you cannot directly refer them like that. – Ronak Shah Aug 25 '20 at 10:09
  • Thanks. This is based on a solution I adopted from: https://stackoverflow.com/a/63502317/6105259. And somehow it _does_ know how to do `lower_ci_link = fit - critval * se.fit` without needing to specify `predicted_values$fit`. The code (now corrected) in the question in this post also uses the same principle and works for me without explicit reference to `predicted_values$fit`. – Emman Aug 25 '20 at 10:29
  • Sorry for the hassle, but I had to revise the entire question because there was an error of wrong parenthesis placement in my model which threw off the entire code. Would you please consider looking at the question with a fresh eye? The code has to run well now (fingers crossed). Regardless, the thing not working for me with your solution is that I need `lower_ci_inverse_link` and `upper_ci_inverse_link` to be _within_ `predicted_values`.The question now asks that directly. Thanks. – Emman Aug 25 '20 at 11:37
  • @Emman Check updated answer. Maybe that is what you were looking for. – Ronak Shah Aug 26 '20 at 11:27