0

Background

I am creating a Sankey Diagram in R and I am struggling with labeling the nodes.

As example, I will reuse a dataset with 10 imaginary patients that are screened for COVID-19. At baseline, all patients are negative for COVID-19. After let’s say 1 week, all patients are tested again: now, 3 patients are positive, 6 are negative and 1 has an inconclusive result. Yet another week later, the 3 positive patients remain positive, 1 patient goes from negative to positive, and the others are negative.

data <- data.frame(patient = 1:10, 
                   baseline = rep("neg", 10), 
                   test1 = c(rep("pos",3), rep("neg", 6), "inconcl"), 
                   test2 = c( rep(NA, 3), "pos", rep("neg", 6) ))

Attempt

To create the Sankey diagram, I am using the ggsankey package:

library(tidyverse)
#devtools::install_github("davidsjoberg/ggsankey")
df <- data %>%
  make_long(baseline, test1, test2)

ggplot(df, aes(x = x, next_x = next_x, node = node, next_node = next_node,
               fill = factor(node), label = node)) +
  geom_sankey() +
  geom_sankey_label(aes(fill = factor(node)), size = 3, color = "white") +
  scale_fill_manual(values = c("grey", "green", "red")) +
  theme(legend.position = "bottom", legend.title = element_blank())

enter image description here

Question

I would like to label the nodes with the number of patients that are present in each node (e.g., the first node would be labeled as 10, and the inconclusive node would be labeled as 1, and so on...).

How do you do this in R without hardcoding the values?

Parts of solution

To extract the numbers from the data, I thought the initial step should be something like:

data %>% count(baseline, test1, test2)
#  baseline   test1 test2 n
#1      neg inconcl   neg 1
#2      neg     neg   neg 5
#3      neg     neg   pos 1
#4      neg     pos  <NA> 3

I think that if I am able to include the proper values in an extra column of the long data df, I should be able to call label=variable_name from the aesthetics?

user213544
  • 2,046
  • 3
  • 22
  • 52

2 Answers2

2

Try this:

library(ggplot2)
library(ggsankey)
library(dplyr)


# create a count data frame for each node

df_nr <- 
  df %>% 
  filter(!is.na(node)) %>% 
  group_by(x, node)%>% 
  summarise(count = n())
#> `summarise()` has grouped output by 'x'. You can override using the `.groups` argument.

# join to sankey dataframe

df <- 
  df %>% 
  left_join(df_nr)




ggplot(df, aes(x = x, next_x = next_x, node = node, next_node = next_node,
               fill = factor(node))) +
  geom_sankey() +
  geom_sankey_label(aes(label = node), size = 3, color = "white") +
  geom_sankey_text(aes(label = count), size = 3.5, vjust = -1.5, check_overlap = TRUE) +
  scale_fill_manual(values = c("grey", "green", "red")) +
  theme_minimal()+
  theme(legend.position = "bottom",
        legend.title = element_blank())

data

data <- data.frame(patient = 1:10, 
                   baseline = rep("neg", 10), 
                   test1 = c(rep("pos",3), rep("neg", 6), "inconcl"), 
                   test2 = c( rep(NA, 3), "pos", rep("neg", 6) ))
df <- data %>%
  make_long(baseline, test1, test2)

You can adjust the placement of the count label or change it to label if you want a bounding box (not so sure this works so well). Not sure if geom_sankey_label recognises check_overlap to avoid multiple overlaps of the count text.

Created on 2021-04-20 by the reprex package (v2.0.0)

Peter
  • 11,500
  • 5
  • 21
  • 31
  • Hi Peter, can you please answer my question (https://stackoverflow.com/questions/68266728/in-r-how-to-display-value-on-the-links-paths-of-sankey-graph) which is also regarding sankey diagram? I tried a lot. – Md. Sabbir Ahmed Jul 06 '21 at 10:14
0

I believe I have the response. It's because the version of R have changed. It's ok with R-3.6.1 but not with R-4.3.1