5

I have a continuous variable with a significant proportion of unknowns. My advisor is asking me to put the percentage next to it in the column. This reprex mimics what I am trying to do.

library(tidyverse)
library(gtsummary)

  trial %>%       # included with gtsummary package
  select(trt, age, grade) %>%
  tbl_summary()

I am trying to have the percentage of unknowns listed next to unknown, ideally in parentheses. It would look like 11 (5.5%).

Some have replied with a request for how the missing data appears in my dataset, here is a reprex of that

library(gtsummary)
library(tidyverse)
#> Warning: package 'tibble' was built under R version 4.0.3
#> Warning: package 'readr' was built under R version 4.0.3
library(gtsummary)

df<-
  tibble::tribble(
               ~age,       ~sex,  ~race,          ~weight,
  70, "male",  "white",       50,
  57, "female", "african-american",   87,
  64,  "male",  "white",              NA,
  46,  "male",  "white", 49,
  87,  "male",  "hispanic", 51
  )

df %>%
  select(age,sex,race,weight) %>%
  tbl_summary(type = list(age ~ "continuous", weight ~ "continuous"), missing="ifany")
Daniel D. Sjoberg
  • 8,820
  • 2
  • 12
  • 28
Elliott Chinn
  • 117
  • 1
  • 8
  • I'm not sure there are any missing values in the example data you provided so it's not very useful for testing. Maybe you want `tbl_summary(missing="ifany")`? Otherwise, how exactly are these "unknowns" coded in your data? – MrFlick Dec 22 '20 at 19:50
  • Per the table, ages are unknown for 11 of the subjects. I am assuming that means values were available for 189 subjects and 11 subjects had missing values, but I may be wrong? – Elliott Chinn Dec 22 '20 at 19:59
  • Ah, ok. Then yes. `missing="ifany"` is the default. If you have "unknown" values, those should be coded as NA values so R knows they are missing. It's unclear what your actual data looks like so I'm not sure what the problem is. – MrFlick Dec 22 '20 at 20:01
  • @MrFlick Updated reprex in the original post – Elliott Chinn Dec 22 '20 at 22:07

1 Answers1

7

There are a few ways to report the missing rate. I'll illustrate a few below and you may pick the best solution for you.

  1. Categorical variables: I recommend you make the missing values explicit factor levels before passing the data frame to tbl_summary(). The NA values will no longer be missing, and will be counted in like any other level of the variable.
  2. Continuous variables: Use the statistic= argument to report the rate of missingness.
  3. All variables: Use add_n() to report rate of missingness
library(gtsummary)

trial %>%      
  select(age, response, trt) %>%
  # making the NA value explicit level of factor with `forcats::fct_explicit_na()`
  dplyr::mutate(response = factor(response) %>% forcats::fct_explicit_na()) %>%
  tbl_summary(
    by = trt,
    type = all_continuous() ~ "continuous2",
    statistic = all_continuous() ~ c("{N_nonmiss}/{N_obs} {p_nonmiss}%",
                                     "{median} ({p25}, {p75})")
  ) %>%
  add_n(statistic = "{n} / {N}")

enter image description here

EDIT: Adding more example after comments from original poster.

library(gtsummary)

trial %>%      
  select(age, response, trt) %>%
  # making the NA value explicit level of factor with `forcats::fct_explicit_na()`
  dplyr::mutate(response = factor(response) %>% forcats::fct_explicit_na(na_level = "Unknown")) %>%
  tbl_summary(
    by = trt,
    type = all_continuous() ~ "continuous2",
    missing = "no",
    statistic = all_continuous() ~ c("{median} ({p25}, {p75})",
                                     "{N_miss} ({p_miss}%)")
  ) %>%
  # udpating the Unknown label in the `.$table_body`
  modify_table_body(
    dplyr::mutate,
    label = ifelse(label == "N missing (% missing)",
                   "Unknown",
                   label)
  )

enter image description here

Daniel D. Sjoberg
  • 8,820
  • 2
  • 12
  • 28
  • The variable I have is a continuous variable (like age in the reprex), so turning it into a factor doesn't work, otherwise this would be perfect. – Elliott Chinn Dec 22 '20 at 20:12
  • Using your code, I was able to mock up what I want, below, only I want "N missing (% not missing)%" to read as "Unknown" `library(gtsummary) trial %>% select(age) %>% tbl_summary(missing = "no", type = all_continuous() ~ "continuous2", statistic = all_continuous() ~ c("{N_miss} ({p_nonmiss})%", "{median} ({p25}, {p75})") )` – Elliott Chinn Dec 22 '20 at 23:08
  • I added another example that I *think* is what you're looking for. – Daniel D. Sjoberg Dec 22 '20 at 23:40
  • 1
    That worked perfectly, but is there a way to prevent it from adding that to the rest of the continuous variables, specifically ones that don't have any missing values? – Elliott Chinn Dec 23 '20 at 01:31
  • Using the statistic argument, you can specify the stats to present for each variable – Daniel D. Sjoberg Dec 23 '20 at 01:43