Creating edges for a riverplot

Question

I am hoping to use the riverplot package to create a flow diagram. This package needs 'edges' which are flows between levels. I want to create an edges data structure from a data frame. By way of example here is some code to create my input data.

rp.df<-structure(list(ID = 1:20, X1 = structure(c(1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "A1", class = "factor"), 
X2 = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("A2", 
"B2"), class = "factor"), X3 = structure(c(1L, 1L, 2L, 2L, 
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 
3L), .Label = c("A3", "B3", "C3"), class = "factor")), class = "data.frame", row.names = c(NA, 
-20L))
table(rp.df$X1,rp.df$X2)
table(rp.df$X2,rp.df$X3)

with this output

> table(rp.df$X1,rp.df$X2)

     A2 B2
  A1 12  8
> table(rp.df$X2,rp.df$X3)

     A3 B3 C3
  A2  2  2  8
  B2  5  2  1

what I need is a dataframe with the flows identified in the tables, eg:

N1 N2 Value
A1 A2    12
A1 B2     8
A2 A3     2
A2 B3     2
A2 C3     8
B2 A3     5
B2 B3     2
B2 C3     1

In reality I have 10 columns of edges and 16k in flows. I have tried using reshape2 to do this but struggled.

score 2 · Accepted Answer · answered Aug 04 '18 at 16:08

Here's a base R solution, generalized for however many columns you have.

out <- lapply(2:(ncol(rp.df) - 1), function(i) {
  as.data.frame(table(rp.df[, i], rp.df[, i + 1]))
  }
)
setNames(do.call(rbind, out), c("N1", "N2", "Value"))
#   N1 N2 Value
# 1 A1 A2    12
# 2 A1 B2     8
# 3 A2 A3     2
# 4 B2 A3     5
# 5 A2 B3     2
# 6 B2 B3     2
# 7 A2 C3     8
# 8 B2 C3     1

score 0 · Answer 2 · answered Aug 04 '18 at 16:04

Here is a tidyverse solution. select(rp.df, X1:X2) is for the first X column to the one before the last X column. select(rp.df, X2:X3) is for the second X column to the last X column. By doing this you can ensure each column combination is addressed. dat is the final output.

library(tidyverse)

dat <- map2_dfr(select(rp.df, X1:X2), 
                select(rp.df, X2:X3),
                ~as_data_frame(table(.x, .y))) %>%
  set_names(c("N1", "N2", "Value"))
dat
# # A tibble: 8 x 3
#   N1    N2    Value
#   <chr> <chr> <int>
# 1 A1    A2       12
# 2 A1    B2        8
# 3 A2    A3        2
# 4 B2    A3        5
# 5 A2    B3        2
# 6 B2    B3        2
# 7 A2    C3        8
# 8 B2    C3        1

score 0 · Answer 3 · edited Jun 20 '20 at 09:12

For the sake of completeness, here are two data.table solutions.

The first one binds the node data into one large data object first and aggregates, finally. The second aggregates for each combination of columns and binds the totals, finally.

Bind node data then aggregate

library(data.table)
library(magrittr)
setDT(rp.df)
edges <- lapply(3:ncol(rp.df), 
       function(i) rp.df[, .SD, .SDcols = (i-1L):i]) %>% 
  rbindlist() %>% 
  .[, .(Value = .N), by = .(N1 = X1, N2 = X2 )]
edges

   N1 N2 Value
1: A1 A2    12
2: A1 B2     8
3: A2 A3     2
4: A2 B3     2
5: A2 C3     8
6: B2 A3     5
7: B2 B3     2
8: B2 C3     1

Aggregate node data then bind

nm <- names(rp.df) %>% stringr::str_subset("^X")
edges <- lapply(2:length(nm), 
                function(i) rp.df[, .N, by = c(nm[i-1], nm[i])]) %>%
  rbindlist() 
setnames(edges, c("N1", "N2", "Value"))
edges

   N1 N2 Value
1: A1 A2    12
2: A1 B2     8
3: A2 A3     2
4: A2 B3     2
5: A2 C3     8
6: B2 A3     5
7: B2 B3     2
8: B2 C3     1

Caveat

Note that both approaches are not fully equivalent in case some edges appear multiple times. (They are equivalent for the given sample dataset).

Let's assume the edge (A1, A2) appears in X1 and X2 and also in X2 and X3. The first approach will summarize this in one output row while the second approach will create two output rows. So, the second approach would need an additional aggregation step to produce the same result as the first approach.

Which approach is suitable has to be decided by the OP.

If it is required, the stage or level where an edge is occuring can be recorded as well:

nm <- names(rp.df) %>% stringr::str_subset("^X")
edges <- lapply(2:length(nm), 
                function(i) rp.df[, .N, by = c(nm[i-1], nm[i])]) %>%
  rbindlist(idcol = TRUE) 
setnames(edges, c("Level", "N1", "N2", "Value"))
edges

   Level N1 N2 Value
1:     1 A1 A2    12
2:     1 A1 B2     8
3:     2 A2 A3     2
4:     2 A2 B3     2
5:     2 A2 C3     8
6:     2 B2 A3     5
7:     2 B2 B3     2
8:     2 B2 C3     1

Thanks for the caveat. This does not apply in my case but might in others, so well worth mentioning. — Stephen Clark, Aug 06 '18 at 07:22

Creating edges for a riverplot

3 Answers3

Bind node data then aggregate

Aggregate node data then bind

Caveat