Pandas Dataframe - running through two columns 'Father' and 'Son' to rebuild end-to-end links step by step

Question

I have a long dataframe I need to transform to get a wide one. The long one is :

df = pd.DataFrame({
    'key' : ['E', 'E', 'E', 'E', 'J', 'J', 'J', 'J'],
    'father' : ['A', 'D', 'C', 'B', 'F', 'H', 'G', 'I'],
    'son' : ['B', 'E', 'D', 'C', 'G', 'I', 'H', 'J']
})
df

First thing to do, I think, is to group it by key. Then we have to find where those keys are found into the column 'son', it's the end (and last son) of the link I need to rebuild.

To rebuild the link, I need to look for his 'father'. His 'father' needs to be kept as father of final step and, also needs to be found into 'son'.

I need to iterate this until a 'father' cannot be found into the 'son' column, so it's going to be the father_0 of the link.

I think it could be done iterating those steps into a recursive function where the stop case : is 'father' not found in 'son'.

Here is the dataframe I want to get from this :

df1 = pd.DataFrame({
    'key' : ['E', 'J'],
    'father_1' : ['A', 'F'],
    'son_1' : ['B', 'G'],
    'father_2' : ['B', 'G'],
    'son_2' : ['C', 'H'],
    'father_3' : ['C', 'H'],
    'son_3' : ['D', 'I'],
    'father_4' : ['D', 'I'],
    'son_4' : ['E', 'J'],
})
df1

I simplified the problem here with 2 different links of the same depth, but they could be from depth 1 to depth 10 (maybe more but rarely and unpredictably) for a lot of different keys. Here is another example of df with 2 links of different size :

df_ = pd.DataFrame({
    'key' : ['E', 'E', 'E', 'E', 'K', 'K', 'K', 'K', 'K'],
    'father' : ['A', 'D', 'C', 'B', 'F', 'H', 'G', 'I', 'J'],
    'son' : ['B', 'E', 'D', 'C', 'G', 'I', 'H', 'J', 'K']
})
df_

df_1 = pd.DataFrame({
    'key' : ['E', 'K'],
    'father_1' : ['A', 'F'],
    'son_1' : ['B', 'G'],
    'father_2' : ['B', 'G'],
    'son_2' : ['C', 'H'],
    'father_3' : ['C', 'H'],
    'son_3' : ['D', 'I'],
    'father_4' : ['D', 'I'],
    'son_4' : ['E', 'J'],
    'father_5' : [np.NaN, 'J'],
    'son_5' : [np.NaN, 'K']
})
df_1

Then the final step is easy, it's about taking 'father_x' and 'son_x-1' into 'step_x-1' : So the resulting dataframes for these examples would be :

df2 = pd.DataFrame({
    'key' : ['E', 'J'],
    'step_0' : ['A', 'F'],
    'step_1' : ['B', 'G'],
    'step_2' : ['C', 'H'],
    'step_3' : ['D', 'I'],
    'step_4' : ['E', 'J'],
})
df2

df_2 = pd.DataFrame({
    'key' : ['E', 'K'],
    'step_0' : ['A', 'F'],
    'step_1' : ['B', 'G'],
    'step_2' : ['C', 'H'],
    'step_3' : ['D', 'I'],
    'step_4' : ['E', 'J'],
    'step_5' : [np.NaN, 'K']
})
df_2

My concerne is more about the way to aggregate the data from wide to long following the previously given rules into an recursive function.

It's like in a groupby.agg but that I can't just pass a dictionnary into it because the new columns are based on the number of iteration of the recursive function on each key.

Yes, I already thought about it, but I always have a problem drawing graphs. I always have the error `networkx random_state index is incorrect` even with last version of pandas and networkx. Maybe I will also make a question about it, but I would like to use this library to do this. — Lemisourd, Jan 14 '22 at 16:22
The edit I made is more likely to need networkx, because the examples are not ordered smoothly as I presented them first but ordered randomly. — Lemisourd, Jan 14 '22 at 16:46
I finally tested networkx but I don't understand how to rebuild the links and then turn it back to a wide dataframe (df -> networkx.MultiDiGraph -> df gave me output = input). I'm going to make it only with pandas and post the answer while getting it. However your suggestion was great. — Lemisourd, Jan 19 '22 at 14:39

score 1 · Answer 1 · answered Jan 14 '22 at 16:23

1

Assign the new key with cumcount then we can do pivot

out = df.assign(c = df.groupby('key').cumcount().add(1).astype(str)).pivot('key','c').sort_index(level=1,axis=1)
out.columns = out.columns.map('_'.join)
out
Out[34]: 
    father_1 son_1 father_2 son_2 father_3 son_3 father_4 son_4
key                                                            
E          A     B        B     C        C     D        D     E
J          F     G        G     H        H     I        I     J

answered Jan 14 '22 at 16:23

BENY

317,841
20
164
234

It works well when the fathers and sons are ordered in the right way, but I did the examples to simple, sorry about that. In fact AB could be followed by DE and then by BC and finally by CD. And so your function will return couples 1 to 4 as AB DE BC CD and not AB BC CD DE. – Lemisourd Jan 14 '22 at 16:42

score 0 · Accepted Answer · answered Mar 07 '22 at 15:49

I found a solution for this specific type of dataframe : where we only have 1 predecessor for all values except root. It also requires using NetworkX. I didn't find a way to do it only using Pandas.

First, we need to build a graph from edgelist :

G = nx.from_pandas_edgelist(df, 'father', 'son', create_using=nx.MultiDiGraph, edge_key = 'key')
from networkx.drawing.nx_agraph import write_dot, graphviz_layout
#write_dot(G,'test.dot')
plt.title('draw_networkx')
pos =graphviz_layout(G, prog='dot')
nx.draw(G, pos, with_labels=True, arrows=True)

For pygraphviz install, please see this question. Then end-to-end links dataframe is built with :

num=0
num_max = len(df.key.drop_duplicates())
m_max = 30
dfy = pd.DataFrame(index=range(num_max),columns=range(m_max))
for n in df.key.drop_duplicates() :
    m = 0
    dfy.iloc[num, m] = n
    while len(list(G.predecessors(dfy.iloc[num,m])))!=0 :
        dfy.iloc[num,m+1] = list(G.predecessors(dfy.iloc[num,m]))[0]
        m+=1
    num+=1
print(dfy)

Output :

   0  1  2  3  4    5    6    7    8    9  ...   
0  E  D  C  B  A  NaN  NaN  NaN  NaN  NaN  ...  
1  K  J  I  H  G    F  NaN  NaN  NaN  NaN  ...

Pandas Dataframe - running through two columns 'Father' and 'Son' to rebuild end-to-end links step by step

2 Answers2

Linked