3

In all examples I can find for Sankey/ Alluvial diagrams I see the links come together at the node in such a way that the size of the node is the sum of all the links connecting to it. However, I would like to vizualize a matching procedure, in which 2 databases are matched, into 3 new datasets (A: the data from dataset 1, that could not be matched; B: the data that could be matched between the 2 datasets; and C: the data from dataset 2, that could not be matched).

If I draw a super simple version of this in paint, it looks something like this:

enter image description here

Is there a way to do this in R, python, or D3JS? Preferably in the R package networkD3 ot ggplot, but any software is acceptable.

In my real data, there will be multiple steps of matching and more than 2 datasets, that is why I want to implement this in R, python, or JS and not make an oneof version in Adobe.

Edit

Please, this is a labelled version of the plot, in which 75 from both A and B 'connect' together to D. So that A + B > C + D + E

enter image description here

L Smeets
  • 888
  • 4
  • 17

1 Answers1

0
  • you have stated python in question, hence answer in python
  • this really is all about data preparation. re-using this answer plotly sankey graph data formatting
  • create two data frames, one of 100 rows and a second one of 150 rows. These will have overlaps based on key has overlap values
  • find the counts of rows that are only in data frame / overlap
  • prepare nodes names and values. I don't see a way to fully meet your calculation. I've elected that target has correct value, however source is half of what actually flows
  • create figure
import pandas as pd
import plotly.graph_objects as go
import numpy as np

# two data sets with some overlap...
df1 = pd.DataFrame({"key":range(0,100)})
df2 = pd.DataFrame({"key":range(25, 175)})

# calculate overlap rows
o1 = df1["key"].isin(df2["key"]).value_counts()
o2 = df2["key"].isin(df1["key"]).value_counts()

# prep overlap rows into structure ready for sankey figure
df = pd.DataFrame([{"source":str(len(df1)), "target":"a" + str(o1[False]), "value":o1[False]},
              {"source":str(len(df1)), "target":"b" + str(o1[True]), "value":o1[True]/2},
              {"source":str(len(df2)), "target":"b" + str(o2[True]), "value":o2[True]/2},
              {"source":str(len(df2)), "target":"c" + str(o2[False]), "value":o2[False]},
             ])

# build sankey figure
nodes = np.unique(df[["source","target"]], axis=None)
nodes = pd.Series(index=nodes, data=range(len(nodes)))

go.Figure(
    go.Sankey(
        node={"label": nodes.index},
        link={
            "source": nodes.loc[df["source"]],
            "target": nodes.loc[df["target"]],
            "value": df["value"],
        },
    )
)

enter image description here

Rob Raymond
  • 29,118
  • 3
  • 14
  • 30
  • Sorry, but this does not answer the question. This is just a normal Sankey graph. What I need is the orange note to be a match of the green and purple node. So in my example the first 2 nodes sum to 250, but the second pair only to a 175, because the in the middle node the links 'connect' together instead of separate (as they do in your answer) – L Smeets Oct 24 '21 at 20:08
  • so what you diagrammed manually is not what you want? in your example first two nodes sum to 100 not 175. I can achieve anything on the values and node names if defined clearly – Rob Raymond Oct 24 '21 at 20:57
  • sorry for the confusion, I should have maybe labelled them. My left 2 nodes sum to 250 (150 + 100) and the right 3 nodes sum to 175 (25 + 75 + 75), because the node in the right middle 'receives' 75 from 2 nodes, but instead of being a 150 they are combined to 75 (representing matched cases). – L Smeets Oct 24 '21 at 21:34
  • got it ... updated. It's a partial solution. given a value is on a flow I can't think of a way to get sources and targets to show values you want. also changed from random to ranges to be using same numbers as your example – Rob Raymond Oct 24 '21 at 22:05
  • Thanks for this partial solution!! This will indeed solve some use-cases, but does not easily extend to adding many extra nodes and (more importantly) completely reduces the visual appeal of the Sankey chart, because even though the values next to the nodes no longer sum to 250, their sizes (heights) do still indicate that they would (stacked the sized of the nodes on the left, are as tall as the stacked nodes on the right). – L Smeets Oct 25 '21 at 11:15