0

I want to generate a Sankey Diagram from product data looking like this.

   id  begin_date   status  
   1   01.02.2020   a       
   1   10.02.2020   b       
   1   17.02.2020   c       
   2   02.02.2020   d       
   2   06.03.2020   b       
   2   17.04.2020   c    

For your experimentation:

pd.DataFrame([[1, '2020-02-01', 'a'], [1, '2020-02-10', 'b'], [1, '2020-02-17', 'c'], [2, '2020-02-02', 'd'], [2, '2020-03-06', 'b'],[2, '2020-04-17', 'c']], columns=['id', 'begin_date', 'status'])

After looking at this explanation: Draw Sankey Diagram from dataframe I want to construct the "Source-Target-Value"-Dataframe looking like this. To improve understanding, I did not convert Source and Target to integers.

# with Source = previous status
# with Target = next status
# with Value = count of IDs that transition from Source to Target
Source  Target      Value      Link Color
     a       b          1      rgba(127, 194, 65, 0.2)
     b       c          2      rgba(127, 194, 65, 0.2)
     d       b          1      rgba(211, 211, 211, 0.5)

The problem lies in generating Source, Target, and Value. The Source and Target should be the status transition from a to b. The Value is the count of ids doing that transition.

What is the best way to do this?

EDIT: Using an online generator, the result would look like this: enter image description here

Jonathan Leffler
  • 730,956
  • 141
  • 904
  • 1,278
Dustin
  • 483
  • 3
  • 13

1 Answers1

0

Found the answer!

# assuming df is sorted by begin_date
import pandas as pd
df = pd.read_csv(r"path")
dfs = []
unique_ids = df["id"].unique()
for uid in unique_ids:
    df_t = df[df["id"] == uid].copy()
    df_t["status_next"] = df_t["status"].shift(-1)
    df_t["status_append"] = df_t["status"] +  df_t["status_next"]
    df_t = df_t.groupby("status_append").agg(Value=("status_append","count")).reset_index()
    dfs.append(df_t)

df = pd.concat(dfs, ignore_index=True)
df = df.groupby("status_append").agg(Value=("Value","sum")).reset_index()

df["Source"] = df['status_append'].astype(str).str[0]
df["Target"] = df['status_append'].astype(str).str[1]
df = df.drop("status_append", axis=1)
df = df[["Source", "Target", "Value"]]

yields

Source  Target  Value
a            b      1
b            c      2
d            b      1
Dustin
  • 483
  • 3
  • 13