0

I have a long dataframe I need to transform to get a wide one. The long one is :

df = pd.DataFrame({
    'key' : ['E', 'E', 'E', 'E', 'J', 'J', 'J', 'J'],
    'father' : ['A', 'D', 'C', 'B', 'F', 'H', 'G', 'I'],
    'son' : ['B', 'E', 'D', 'C', 'G', 'I', 'H', 'J']
})
df

First thing to do, I think, is to group it by key. Then we have to find where those keys are found into the column 'son', it's the end (and last son) of the link I need to rebuild.

To rebuild the link, I need to look for his 'father'. His 'father' needs to be kept as father of final step and, also needs to be found into 'son'.

I need to iterate this until a 'father' cannot be found into the 'son' column, so it's going to be the father_0 of the link.

I think it could be done iterating those steps into a recursive function where the stop case : is 'father' not found in 'son'.

Here is the dataframe I want to get from this :

df1 = pd.DataFrame({
    'key' : ['E', 'J'],
    'father_1' : ['A', 'F'],
    'son_1' : ['B', 'G'],
    'father_2' : ['B', 'G'],
    'son_2' : ['C', 'H'],
    'father_3' : ['C', 'H'],
    'son_3' : ['D', 'I'],
    'father_4' : ['D', 'I'],
    'son_4' : ['E', 'J'],
})
df1

I simplified the problem here with 2 different links of the same depth, but they could be from depth 1 to depth 10 (maybe more but rarely and unpredictably) for a lot of different keys. Here is another example of df with 2 links of different size :

df_ = pd.DataFrame({
    'key' : ['E', 'E', 'E', 'E', 'K', 'K', 'K', 'K', 'K'],
    'father' : ['A', 'D', 'C', 'B', 'F', 'H', 'G', 'I', 'J'],
    'son' : ['B', 'E', 'D', 'C', 'G', 'I', 'H', 'J', 'K']
})
df_

df_1 = pd.DataFrame({
    'key' : ['E', 'K'],
    'father_1' : ['A', 'F'],
    'son_1' : ['B', 'G'],
    'father_2' : ['B', 'G'],
    'son_2' : ['C', 'H'],
    'father_3' : ['C', 'H'],
    'son_3' : ['D', 'I'],
    'father_4' : ['D', 'I'],
    'son_4' : ['E', 'J'],
    'father_5' : [np.NaN, 'J'],
    'son_5' : [np.NaN, 'K']
})
df_1 

Then the final step is easy, it's about taking 'father_x' and 'son_x-1' into 'step_x-1' : So the resulting dataframes for these examples would be :

df2 = pd.DataFrame({
    'key' : ['E', 'J'],
    'step_0' : ['A', 'F'],
    'step_1' : ['B', 'G'],
    'step_2' : ['C', 'H'],
    'step_3' : ['D', 'I'],
    'step_4' : ['E', 'J'],
})
df2

df_2 = pd.DataFrame({
    'key' : ['E', 'K'],
    'step_0' : ['A', 'F'],
    'step_1' : ['B', 'G'],
    'step_2' : ['C', 'H'],
    'step_3' : ['D', 'I'],
    'step_4' : ['E', 'J'],
    'step_5' : [np.NaN, 'K']
})
df_2

My concerne is more about the way to aggregate the data from wide to long following the previously given rules into an recursive function.

It's like in a groupby.agg but that I can't just pass a dictionnary into it because the new columns are based on the number of iteration of the recursive function on each key.

Lemisourd
  • 135
  • 1
  • 11
  • 1
    networkx would be helpful. – Quang Hoang Jan 14 '22 at 16:19
  • Yes, I already thought about it, but I always have a problem drawing graphs. I always have the error `networkx random_state index is incorrect` even with last version of pandas and networkx. Maybe I will also make a question about it, but I would like to use this library to do this. – Lemisourd Jan 14 '22 at 16:22
  • Maybe worth another question on that error? – Quang Hoang Jan 14 '22 at 16:23
  • I think I will, it may be useful. – Lemisourd Jan 14 '22 at 16:29
  • The edit I made is more likely to need networkx, because the examples are not ordered smoothly as I presented them first but ordered randomly. – Lemisourd Jan 14 '22 at 16:46
  • I finally tested networkx but I don't understand how to rebuild the links and then turn it back to a wide dataframe (df -> networkx.MultiDiGraph -> df gave me output = input). I'm going to make it only with pandas and post the answer while getting it. However your suggestion was great. – Lemisourd Jan 19 '22 at 14:39

2 Answers2

1

Assign the new key with cumcount then we can do pivot

out = df.assign(c = df.groupby('key').cumcount().add(1).astype(str)).pivot('key','c').sort_index(level=1,axis=1)
out.columns = out.columns.map('_'.join)
out
Out[34]: 
    father_1 son_1 father_2 son_2 father_3 son_3 father_4 son_4
key                                                            
E          A     B        B     C        C     D        D     E
J          F     G        G     H        H     I        I     J
BENY
  • 317,841
  • 20
  • 164
  • 234
  • It works well when the fathers and sons are ordered in the right way, but I did the examples to simple, sorry about that. In fact AB could be followed by DE and then by BC and finally by CD. And so your function will return couples 1 to 4 as AB DE BC CD and not AB BC CD DE. – Lemisourd Jan 14 '22 at 16:42
0

I found a solution for this specific type of dataframe : where we only have 1 predecessor for all values except root. It also requires using NetworkX. I didn't find a way to do it only using Pandas.

First, we need to build a graph from edgelist :

G = nx.from_pandas_edgelist(df, 'father', 'son', create_using=nx.MultiDiGraph, edge_key = 'key')
from networkx.drawing.nx_agraph import write_dot, graphviz_layout
#write_dot(G,'test.dot')
plt.title('draw_networkx')
pos =graphviz_layout(G, prog='dot')
nx.draw(G, pos, with_labels=True, arrows=True)

For pygraphviz install, please see this question. Then end-to-end links dataframe is built with :

num=0
num_max = len(df.key.drop_duplicates())
m_max = 30
dfy = pd.DataFrame(index=range(num_max),columns=range(m_max))
for n in df.key.drop_duplicates() :
    m = 0
    dfy.iloc[num, m] = n
    while len(list(G.predecessors(dfy.iloc[num,m])))!=0 :
        dfy.iloc[num,m+1] = list(G.predecessors(dfy.iloc[num,m]))[0]
        m+=1
    num+=1
print(dfy)

Output :

   0  1  2  3  4    5    6    7    8    9  ...   
0  E  D  C  B  A  NaN  NaN  NaN  NaN  NaN  ...  
1  K  J  I  H  G    F  NaN  NaN  NaN  NaN  ...
Lemisourd
  • 135
  • 1
  • 11