1

I have a pandas dataframe which contains a list of big family members.

import pandas as pd

data = {'child':['Joe','Anna','Anna','Steffani','Bob','Rea','Dani','Dani','Selma','John','Kevin'],
             'parents':['Steffani','Bob','Steffani','Dani','Selma','Anna','Selma','John','Kevin','-','Robert'],
            }
df = pd.DataFrame(data)

From this dataframe, I need to build a new table by adding multiple columns to the right which show the relationships between data. The value in the right column shows the elder relation. Each column represents the relationship. If I can draw the diagram, it may look like this:

child --> parents --> grandparents --> parents of grandparents --> grandparents of grandparents --> etc.

So, the expected output of the dataframe will be like this:

    child       parents     A           B           C           D (etc)
---------------------------------------------------------------------------------
0   Joe         Steffani    Dani        Selma       Kevin       <If still possible>
1   Joe         Steffani    Dani        John        -
2   Anna        Bob         Selma       Kevin       Robert
3   Anna        Steffani    Dani        Selma       Kevin
4   Anna        Steffani    Dani        John        -
5   Steffani    Dani        Selma       Kevin       Robert
6   Steffani    Dani        John        -           -
7   Bob         Selma       Kevin       Robert      -
8   Rea         Anna        Bob         Selma       Kevin
9   Rea         Anna        Steffani    Dani        Selma
10  Rea         Anna        Steffani    Dani        John
11  Dani        Selma       Kevin       Robert      -
12  Dani        John        -           -           -
13  Selma       Kevin       Robert      -           -
14  John        -           -           -           -
15  Kevin       Robert      -           -           -

Currently, I build the new table manually using pandas.merge. But I need to do it many times until the last column which has no elder relationship with the left column. For example:

Step 1

df2 = pd.merge(df, df, left_on='parents', right_on='child', how='left').fillna('-')
df2 = df2[['child_x','parents_x','parents_y']]
df2.columns = ['child','parents','A']

Step 2

df3 = pd.merge(df2, df, left_on='A', right_on='child', how='left').fillna('-')
df3 = df3[['child_x','parents_x','A','parents_y']]
df3.columns = ['child','parents','A','B']

Step 3

df4 = pd.merge(df3, df, left_on='B', right_on='child', how='left').fillna('-')
df4 = df4[['child_x','parents_x','A','B','parents_y']]
df4.columns = [['child','parents','A','B','C']]

Step 4

Write the similar code to add the 6th column for column D if the value in Column C still has elder relation.

Problem:

Since I have big data in the dataframe (more than 10K of datapoints), how to solve it without writing code step by step? I don't know how many steps I need to build the final table.

Thanks in advance for any help.

YusufUMS
  • 1,506
  • 1
  • 12
  • 24

2 Answers2

1

Consider the chain merge with reduce using suffixes argument of merge with some handling of duplicate column names and removal of intermediate child columns:

def proc_build(x,y):
    temp = (pd.merge(x, y, left_on='parents', right_on='child', 
                     how='left', suffixes=['_',''])                     
              .fillna('-'))

    return temp       

final_df = (reduce(proc_build, [df, df, df, df])
               .set_axis(['child', 'parents',
                          'child1', 'A', 
                          'child2', 'B',
                          'child3', 'C'], axis='columns', inplace=False)
               .reindex(['child', 'parents'] + list('ABC'), axis='columns')
           )

print(final_df)

#        child   parents         A       B       C
# 0        Joe  Steffani      Dani   Selma   Kevin
# 1        Joe  Steffani      Dani    John       -
# 2       Anna       Bob     Selma   Kevin  Robert
# 3       Anna  Steffani      Dani   Selma   Kevin
# 4       Anna  Steffani      Dani    John       -
# 5   Steffani      Dani     Selma   Kevin  Robert
# 6   Steffani      Dani      John       -       -
# 7        Bob     Selma     Kevin  Robert       -
# 8        Rea      Anna       Bob   Selma   Kevin
# 9        Rea      Anna  Steffani    Dani   Selma
# 10       Rea      Anna  Steffani    Dani    John
# 11      Dani     Selma     Kevin  Robert       -
# 12      Dani      John         -       -       -
# 13     Selma     Kevin    Robert       -       -
# 14      John         -         -       -       -
# 15     Kevin    Robert         -       -       -

To extend for another column such as D, add another df to iterable argument of reduce with additional list items in set_axis and reindex, specifically ['child4', 'D'] and list('ABCD'). While there are ways to make these items dynamic, reduce can get expensive and so should be handled with some declarative emphasis.

final_df = (reduce(proc_build, [df] * 5)
               .set_axis(['child', 'parents',
                          'child1', 'A', 
                          'child2', 'B',
                          'child3', 'C',
                          'child4', 'D'], axis='columns', inplace=False)
               .reindex(['child', 'parents'] + list('ABCD'), axis='columns')
           )

print(final_df)

#        child   parents         A       B       C       D
# 0        Joe  Steffani      Dani   Selma   Kevin  Robert
# 1        Joe  Steffani      Dani    John       -       -
# 2       Anna       Bob     Selma   Kevin  Robert       -
# 3       Anna  Steffani      Dani   Selma   Kevin  Robert
# 4       Anna  Steffani      Dani    John       -       -
# 5   Steffani      Dani     Selma   Kevin  Robert       -
# 6   Steffani      Dani      John       -       -       -
# 7        Bob     Selma     Kevin  Robert       -       -
# 8        Rea      Anna       Bob   Selma   Kevin  Robert
# 9        Rea      Anna  Steffani    Dani   Selma   Kevin
# 10       Rea      Anna  Steffani    Dani    John       -
# 11      Dani     Selma     Kevin  Robert       -       -
# 12      Dani      John         -       -       -       -
# 13     Selma     Kevin    Robert       -       -       -
# 14      John         -         -       -       -       -
# 15     Kevin    Robert         -       -       -       -
Parfait
  • 104,375
  • 17
  • 94
  • 125
  • Then what if I don't know whether columns E, F etc will be needed in this process? But at least your code is helpful. Thanks – YusufUMS Feb 14 '20 at 01:33
0

This is a rough solution from me. You should optimize it.

  • Loading all datafrme
  • Save all the dataframe's name in list
list_data = [data1,data2]
list_df = []
i = 0
for data in list_data:
    vars()[f'df{i}'] = pd.DataFrame(data)
    list_df.append(f'df{i}')
    i += 1
  • Then create 2 proxy variables;
    • df_family : This will be an output
    • last_df : For breaking the loop, in case, every rows in parent column is '-' but there are dataframe left in the list.
last_df = False
df_family = pd.DataFrame()

  • This parts will merge the dataframe together as you wish. I also change the name to 1,2,...,n so you can rename it easily.
for df in list_df:
    if last_df:
        break

    if (eval(df)['parents'] == '-').all():
        last_df = True

    if df_family.empty:
        df_family = eval(df)
    else:
        df_family = pd.merge(df_family,eval(df), how = 'left', left_on = df_family.columns[-1], right_on = eval(df).columns[0])
        df_family.drop(columns = [eval(df).columns[0]], axis = 1, inplace = True)

    list_cols = [i for i in range(df_family.shape[1])]
    df_family.columns = list_cols
LaChatonnn
  • 159
  • 7
  • 1
    In pandas, if you ever have to call `pd.DataFrame()`, you are not optimally processing objects. Growing a series/data frame with `append` or `merge` inside a loop is a [very expensive operation](https://stackoverflow.com/a/36489724/1422451). Instead build a list/dict of pandas objects to be concatenated once outside a loop. – Parfait Feb 12 '20 at 16:27
  • @Parfait Thank you for your advice. – LaChatonnn Feb 12 '20 at 16:59