Sankey Diagram with multiple colums and weight column - using NetworkD3 package

Question

I am trying to make an interactive Sankey with the networkd3 package. I have a dataset with eight columns.

df <- read.csv(header = TRUE, as.is = TRUE, text = '
clientcode,year1,year2,year3,year4,year5,year6,year7
1,DBC,DBBC,DBBC,DBC,DBC,"Not in care","Not in care"
2,DBC,DBBC,DBBC,"Not in care","Not in care","Not in care","Not in care"
3,DBC,DBBC,"Not in care","Not in care","Not in care","Not in care","Not in care"
4,DBC,DBBC,"Not in care","Not in care","Not in care","Not in care","Not in care"
5,DBC,DBBC,DBBC,"Not in care","Not in care","Not in care","Not in care"
')

I am using the code below in this post starting with "This question comes up a lot...": https://stackoverflow.com/a/52237151/4389763

This is the code I have:

df <- df %>% select(year1,year2,year3,year4,year5,year6,year7) 

links <-
df %>%
mutate(row = row_number()) %>%
gather('column', 'source', -row) %>%
mutate(column = match(column, names(df))) %>%
group_by(row) %>%
arrange(column) %>%
mutate(target = lead(source)) %>%
ungroup() %>%
filter(!is.na(target))

links <-
links %>%
mutate(source = paste0(source, '_', column)) %>%
mutate(target = paste0(target, '_', column + 1)) %>%
select(source, target)

nodes <- data.frame(name = unique(c(links$source, links$target)))

links$source <- match(links$source, nodes$name) - 1
links$target <- match(links$target, nodes$name) - 1
links$value <- 1

nodes$name <- sub('_[0-9]+$', '', nodes$name)

library(networkD3)
library(htmlwidgets)

sankeyNetwork(Links = links, Nodes = nodes, Source = 'source',
          Target = 'target', Value = 'value', NodeID = 'name')

But I don't know how to add the value of the flow. For example from DBC to DBBC occurs five times in year1 to year2. And DBBC to DBBC occurs three times from year2 to year3. With the code above I see every occurance as 1 and I would like to see the total value of a flow.

Like this example of a Sankey. Where you can see the total of for example group_A to group_C and not every occurance.

And is it possible to see the percentages in the mouse over? For example Year1 = DBC to Year2 = DBBC value is 5 out of 5 and percentage is 100%.

Can someone help me? Thank you.

also, you will have to explain what the weight means... as it is, you have one value for weight per row, but each row has multiple links... if you want that to give the "value" for each link, then you're missing a bunch of data — CJ Yetman, Sep 18 '18 at 08:42
Thanks for your reaction. I have changed the question and added an example. I hope you can help me. @CJYetman — SuGer, Sep 18 '18 at 08:59
Now explain in greater detail what you mean by “add the weight” and “add the percentage”. Weight of what? Percentage of what? What does it mean to “add” it? — CJ Yetman, Sep 18 '18 at 09:23
In your image, where do the groups come from? What do they mean? What is year1 to year2 100% of? — CJ Yetman, Sep 18 '18 at 09:59
The image is just an example. In the Sankey I have every flow is one client. In the example, if a client has the same year1 and year 2, the flow is aggregated. — SuGer, Sep 18 '18 at 10:06

score 0 · Answer 1 · answered Sep 18 '18 at 12:36

0

I have changed the code:

Instead of:

links$value <- 1

The new code:

links <- links %>% group_by(source, target) %>% tally()
names(links)[3] <- "value"

answered Sep 18 '18 at 12:36

SuGer

11
4

1

Please, if it's an update of your question, edit it, do not add it as an answer. – s__ Sep 18 '18 at 13:12

score 0 · Answer 2 · answered Sep 18 '18 at 12:57

The first part of your question--how to get a dataset of links (source and target columns) from a dataset that has multiple links/edges defined on each row across several columns--is sufficiently answered by the answer that you linked to (with the minor addition that you start with an extra column, clientcode, which does not contain link information, so it needs to be removed first).

df <- read.csv(header = TRUE, as.is = TRUE, text = '
clientcode,year1,year2,year3,year4,year5,year6,year7
1,DBC,DBBC,DBBC,DBC,DBC,"Not in care","Not in care"
2,DBC,DBBC,DBBC,"Not in care","Not in care","Not in care","Not in care"
3,DBC,DBBC,"Not in care","Not in care","Not in care","Not in care","Not in care"
4,DBC,DBBC,"Not in care","Not in care","Not in care","Not in care","Not in care"
5,DBC,DBBC,DBBC,"Not in care","Not in care","Not in care","Not in care"
')

library(dplyr)
library(tidyr)

links <-
  df %>%
  select(-clientcode) %>% 
  mutate(row = row_number()) %>%
  gather('column', 'source', -row) %>%
  mutate(column = match(column, names(df))) %>%
  group_by(row) %>%
  arrange(column) %>%
  mutate(target = lead(source)) %>%
  ungroup() %>%
  filter(!is.na(target)) %>%
  mutate(source = paste0(source, '_', column)) %>%
  mutate(target = paste0(target, '_', column + 1)) %>%
  select(source, target)

links

# # A tibble: 30 x 2
#    source target       
#    <chr>  <chr>        
#  1 DBC_2  DBBC_3       
#  2 DBC_2  DBBC_3       
#  3 DBC_2  DBBC_3       
#  4 DBC_2  DBBC_3       
#  5 DBC_2  DBBC_3       
#  6 DBBC_3 DBBC_4       
#  7 DBBC_3 DBBC_4       
#  8 DBBC_3 Not in care_4
#  9 DBBC_3 Not in care_4
# 10 DBBC_3 DBBC_4       
# # ... with 20 more rows

The second part of your question is essentially, with a dataset of individual links, how can I aggregate like links into one link with a value column indicating how many individual links were aggregated into that one link. That can be achieved by grouping the source and target columns and summarizing with the count of rows.

links %>% 
  group_by(source, target) %>% 
  summarise(value = n())

# # A tibble: 11 x 3
# # Groups:   source [?]
#    source        target        value
#    <chr>         <chr>         <int>
#  1 DBBC_3        DBBC_4            3
#  2 DBBC_3        Not in care_4     2
#  3 DBBC_4        DBC_5             1
#  4 DBBC_4        Not in care_5     2
#  5 DBC_2         DBBC_3            5
#  6 DBC_5         DBC_6             1
#  7 DBC_6         Not in care_7     1
#  8 Not in care_4 Not in care_5     2
#  9 Not in care_5 Not in care_6     4
# 10 Not in care_6 Not in care_7     4
# 11 Not in care_7 Not in care_8     5

Since you want to display the percentage, not the count, you can modify that slightly to calculate the percentage of all links in each year, and then use the unit = "%" parameter of sankeyNetwork so that it displays properly.

links <- 
  links %>% 
  group_by(source, target) %>% 
  summarise(value = n() / nrow(df) * 100)

links

# # A tibble: 11 x 3
# # Groups:   source [?]
#    source        target        value
#    <chr>         <chr>         <dbl>
#  1 DBBC_3        DBBC_4           60
#  2 DBBC_3        Not in care_4    40
#  3 DBBC_4        DBC_5            20
#  4 DBBC_4        Not in care_5    40
#  5 DBC_2         DBBC_3          100
#  6 DBC_5         DBC_6            20
#  7 DBC_6         Not in care_7    20
#  8 Not in care_4 Not in care_5    40
#  9 Not in care_5 Not in care_6    80
# 10 Not in care_6 Not in care_7    80
# 11 Not in care_7 Not in care_8   100

nodes <- data.frame(name = unique(c(links$source, links$target)))

links$source <- match(links$source, nodes$name) - 1
links$target <- match(links$target, nodes$name) - 1

nodes$name <- sub('_[0-9]+$', '', nodes$name)

library(networkD3)
library(htmlwidgets)

sankeyNetwork(Links = links, Nodes = nodes, Source = 'source',
              Target = 'target', Value = 'value', NodeID = 'name', 
              units = "%")

Sankey Diagram with multiple colums and weight column - using NetworkD3 package

2 Answers2