Pivot pandas dataframe by filled values

Question

I want to transform/pivot the following dataframe to indicate the immediate data flow from (source) to (target).:

    l0      l1      l2      l3      sum
0   IN      TOTAL   <NA>    <NA>    1
1   <NA>    TOTAL   OUT_A   OUT_B   2
2   <NA>    TOTAL   <NA>    OUT_C   3

In the above example, a data flow is represented by e.g. l0 to l1 in row 0. Equivalently, l1 to l2 and l2 to l3 represent (direct) data flows in row 1, as well as l1 to l3 in row 2.

Expectation:

    source  target  sum
0   IN      TOTAL   1
1   TOTAL   OUT_A   2
2   TOTAL   OUT_C   3
3   OUT_A   OUT_B   2

For reproducibility:

import pandas as pd
df = pd.DataFrame({
    "l0": ['IN', pd.NA, pd.NA],
    "l1": ['TOTAL','TOTAL','TOTAL'],
    "l2": [pd.NA,'OUT_A', pd.NA],
    "l3": [pd.NA,'OUT_B',"OUT_C"],
    "sum": [1,2,3]})

pd.DataFrame({
    "source": ["IN","TOTAL","TOTAL","OUT_A"],
    "target": ["TOTAL","OUT_A","OUT_C","OUT_B"],
    "sum": [1,2,3,2]
})

Why it doesn't have combination `TOTAL - OUT_B` (as it's on the same distance as `TOTAL - OUT_C`) ? — RomanPerekhrest, Apr 24 '23 at 12:33
I am only interested in the direct source->target relationship. Thanks for your comment, I've enhanced the description slightly to cover that. — Thomas, Apr 24 '23 at 13:47

score 3 · Accepted Answer · answered Apr 24 '23 at 12:58

3

Basing on itertools.pairwise (on non-nan values) and pd.concat functions:

from itertools import pairwise

df_ = (pd.concat(df.set_index('sum').apply(
    lambda x: pd.DataFrame([[*p, x.name] for p in pairwise(x.dropna())]),
    axis=1).values, ignore_index=True)
       .set_axis(['source', 'target', 'sum'], axis=1))

  source target  sum
0     IN  TOTAL    1
1  TOTAL  OUT_A    2
2  OUT_A  OUT_B    2
3  TOTAL  OUT_C    3

answered Apr 24 '23 at 12:58

RomanPerekhrest

88,541
4
65
105

2

I'm afraid that a for loop inside `apply` should be very expensive. – mozway Apr 24 '23 at 13:09

jezrael · Answer 2 · 2023-04-24T13:46:03.860

You can use DataFrame.melt with remove missing values by DataFrame.dropna with custom function for slide window, last create 2 columns DataFrame with append sum column:

#https://stackoverflow.com/a/6822773/2901002
def window(seq, n=2):
    return [tuple(seq[i: i + n]) for i in range(len(seq) - n + 1)]
   

s = (df.melt('sum', ignore_index=False)
       .dropna(subset=['value'])
       .reset_index()
       .groupby(['index','sum'])['value']
       .apply(window)
       .explode()
       )

df1 = (pd.DataFrame(s.tolist(), columns=['source', 'target'])
         .assign(sum=s.index.get_level_values('sum')))
print (df1)
  source target  sum
0     IN  TOTAL    1
1  TOTAL  OUT_A    2
2  OUT_A  OUT_B    2
3  TOTAL  OUT_C    3

Another idea in list comprehension:

g = (df.melt('sum', ignore_index=False)
       .dropna(subset=['value'])
       .reset_index()
       .groupby(['index','sum'])['value'])

n = 2

df1 = (pd.DataFrame([(*(k[i: i + n]), j) for (i, j), k in g for i in range(len(k) - n + 1)],
                    columns=['source', 'target','sum']))
print (df1)
  source target  sum
0     IN  TOTAL    1
1  TOTAL  OUT_A    2
2  OUT_A  OUT_B    2
3  TOTAL  OUT_C    3

Stuart · Answer 3 · 2023-04-25T10:46:08.770

Apply a function that gets the pairs of non-null values in each row, explode these pairs to separate rows, expand each pair to separate columns, and combine the result with the sum column from the original dataframe.

def pairs(row):
    " Return a list of the non-null consecutive pairs in a dataframe row "
    items = row.dropna().values
    return [items[i:i+2] for i in range(len(items)-1)]

cols = ["l0", "l1", "l2", "l3"]
result = (df[cols]
          .apply(pairs, axis=1)    
          .explode()               # series of source-target pairs
          .apply(pd.Series)        # expand each pair to two columns
          .rename(columns={0: "source", 1: "target"}) # name the resulting columns
          .assign(sum=df["sum"])   # add the sum column from original df
         )

mozway · Answer 4 · 2023-04-24T14:07:40.923

-1

stack to remove the NAs, then use a dictionary comprehension with groupby, zip and concat to combine the successive values:

out = (pd.concat({k: pd.DataFrame(zip(g, g.iloc[1:]), columns=['source', 'target'])
                 for k, g in df.set_index('sum').stack().groupby(level=0)},
                names=['sum'])
         .reset_index('sum')
         .sort_index(ignore_index=True, kind='stable') # optional
       )

Output:

   sum source target
0    1     IN  TOTAL
1    2  TOTAL  OUT_A
2    3  TOTAL  OUT_C
3    2  OUT_A  OUT_B

edited Apr 24 '23 at 14:07

answered Apr 24 '23 at 12:20

mozway

194,879
13
39
75

Ouput is different in your solution , tested `n = 100 df = pd.concat([df]*(n//3), ignore_index=True)` – jezrael Apr 24 '23 at 13:30
@Thomas have you tried this approach? If it doesn't give the expected output can you clarify why? – mozway Apr 24 '23 at 13:42

Pivot pandas dataframe by filled values

4 Answers4