-1

I want to transform/pivot the following dataframe to indicate the immediate data flow from (source) to (target).:

    l0      l1      l2      l3      sum
0   IN      TOTAL   <NA>    <NA>    1
1   <NA>    TOTAL   OUT_A   OUT_B   2
2   <NA>    TOTAL   <NA>    OUT_C   3

In the above example, a data flow is represented by e.g. l0 to l1 in row 0. Equivalently, l1 to l2 and l2 to l3 represent (direct) data flows in row 1, as well as l1 to l3 in row 2.

Expectation:

    source  target  sum
0   IN      TOTAL   1
1   TOTAL   OUT_A   2
2   TOTAL   OUT_C   3
3   OUT_A   OUT_B   2

For reproducibility:

import pandas as pd
df = pd.DataFrame({
    "l0": ['IN', pd.NA, pd.NA],
    "l1": ['TOTAL','TOTAL','TOTAL'],
    "l2": [pd.NA,'OUT_A', pd.NA],
    "l3": [pd.NA,'OUT_B',"OUT_C"],
    "sum": [1,2,3]})

pd.DataFrame({
    "source": ["IN","TOTAL","TOTAL","OUT_A"],
    "target": ["TOTAL","OUT_A","OUT_C","OUT_B"],
    "sum": [1,2,3,2]
})
Thomas
  • 1,252
  • 6
  • 24
  • Why it doesn't have combination `TOTAL - OUT_B` (as it's on the same distance as `TOTAL - OUT_C`) ? – RomanPerekhrest Apr 24 '23 at 12:33
  • I am only interested in the direct source->target relationship. Thanks for your comment, I've enhanced the description slightly to cover that. – Thomas Apr 24 '23 at 13:47

4 Answers4

3

Basing on itertools.pairwise (on non-nan values) and pd.concat functions:

from itertools import pairwise

df_ = (pd.concat(df.set_index('sum').apply(
    lambda x: pd.DataFrame([[*p, x.name] for p in pairwise(x.dropna())]),
    axis=1).values, ignore_index=True)
       .set_axis(['source', 'target', 'sum'], axis=1))

  source target  sum
0     IN  TOTAL    1
1  TOTAL  OUT_A    2
2  OUT_A  OUT_B    2
3  TOTAL  OUT_C    3
RomanPerekhrest
  • 88,541
  • 4
  • 65
  • 105
1

You can use DataFrame.melt with remove missing values by DataFrame.dropna with custom function for slide window, last create 2 columns DataFrame with append sum column:

#https://stackoverflow.com/a/6822773/2901002
def window(seq, n=2):
    return [tuple(seq[i: i + n]) for i in range(len(seq) - n + 1)]
   

s = (df.melt('sum', ignore_index=False)
       .dropna(subset=['value'])
       .reset_index()
       .groupby(['index','sum'])['value']
       .apply(window)
       .explode()
       )

df1 = (pd.DataFrame(s.tolist(), columns=['source', 'target'])
         .assign(sum=s.index.get_level_values('sum')))
print (df1)
  source target  sum
0     IN  TOTAL    1
1  TOTAL  OUT_A    2
2  OUT_A  OUT_B    2
3  TOTAL  OUT_C    3

Another idea in list comprehension:

g = (df.melt('sum', ignore_index=False)
       .dropna(subset=['value'])
       .reset_index()
       .groupby(['index','sum'])['value'])

n = 2

df1 = (pd.DataFrame([(*(k[i: i + n]), j) for (i, j), k in g for i in range(len(k) - n + 1)],
                    columns=['source', 'target','sum']))
print (df1)
  source target  sum
0     IN  TOTAL    1
1  TOTAL  OUT_A    2
2  OUT_A  OUT_B    2
3  TOTAL  OUT_C    3
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
0

Apply a function that gets the pairs of non-null values in each row, explode these pairs to separate rows, expand each pair to separate columns, and combine the result with the sum column from the original dataframe.

def pairs(row):
    " Return a list of the non-null consecutive pairs in a dataframe row "
    items = row.dropna().values
    return [items[i:i+2] for i in range(len(items)-1)]

cols = ["l0", "l1", "l2", "l3"]
result = (df[cols]
          .apply(pairs, axis=1)    
          .explode()               # series of source-target pairs
          .apply(pd.Series)        # expand each pair to two columns
          .rename(columns={0: "source", 1: "target"}) # name the resulting columns
          .assign(sum=df["sum"])   # add the sum column from original df
         )
Stuart
  • 9,597
  • 1
  • 21
  • 30
-1

stack to remove the NAs, then use a dictionary comprehension with groupby, zip and concat to combine the successive values:

out = (pd.concat({k: pd.DataFrame(zip(g, g.iloc[1:]), columns=['source', 'target'])
                 for k, g in df.set_index('sum').stack().groupby(level=0)},
                names=['sum'])
         .reset_index('sum')
         .sort_index(ignore_index=True, kind='stable') # optional
       )

Output:

   sum source target
0    1     IN  TOTAL
1    2  TOTAL  OUT_A
2    3  TOTAL  OUT_C
3    2  OUT_A  OUT_B
mozway
  • 194,879
  • 13
  • 39
  • 75
  • Ouput is different in your solution , tested `n = 100 df = pd.concat([df]*(n//3), ignore_index=True)` – jezrael Apr 24 '23 at 13:30
  • @Thomas have you tried this approach? If it doesn't give the expected output can you clarify why? – mozway Apr 24 '23 at 13:42