Background
I am creating a Sankey Diagram in R and I am struggling with labeling the nodes.
As example, I will reuse a dataset with 10 imaginary patients that are screened for COVID-19. At baseline, all patients are negative for COVID-19. After let’s say 1 week, all patients are tested again: now, 3 patients are positive, 6 are negative and 1 has an inconclusive result. Yet another week later, the 3 positive patients remain positive, 1 patient goes from negative to positive, and the others are negative.
data <- data.frame(patient = 1:10,
baseline = rep("neg", 10),
test1 = c(rep("pos",3), rep("neg", 6), "inconcl"),
test2 = c( rep(NA, 3), "pos", rep("neg", 6) ))
Attempt
To create the Sankey diagram, I am using the ggsankey
package:
library(tidyverse)
#devtools::install_github("davidsjoberg/ggsankey")
df <- data %>%
make_long(baseline, test1, test2)
ggplot(df, aes(x = x, next_x = next_x, node = node, next_node = next_node,
fill = factor(node), label = node)) +
geom_sankey() +
geom_sankey_label(aes(fill = factor(node)), size = 3, color = "white") +
scale_fill_manual(values = c("grey", "green", "red")) +
theme(legend.position = "bottom", legend.title = element_blank())
Question
I would like to label the nodes
with the number of patients that are present in each node (e.g., the first node would be labeled as 10
, and the inconclusive
node would be labeled as 1
, and so on...).
How do you do this in R without hardcoding the values?
Parts of solution
To extract the numbers from the data, I thought the initial step should be something like:
data %>% count(baseline, test1, test2)
# baseline test1 test2 n
#1 neg inconcl neg 1
#2 neg neg neg 5
#3 neg neg pos 1
#4 neg pos <NA> 3
I think that if I am able to include the proper values in an extra column of the long data df
, I should be able to call label=variable_name
from the aesthetics?