Only keep direct parent child id pairs in dataframe

Question

I have the following dataframe:

   id_parent  id_child
0       1100      1090
1       1100      1080
2       1100      1070
3       1100      1060
4       1090      1080
5       1090      1070
6       1080      1070

and I only want to keep the direct parent child connects. Example: 1100 has 3 connections, but only 1090 shall be kept because 1080 and 1070 are already childs of 1090. This example df only contains 1 sample, the df consists of multiple parent/child clusters.

Therefore the output should look like this:

   id_parent  id_child
0       1100      1090
1       1090      1080
2       1080      1070
3       1100      1060

sample code:

import pandas as pd

#create sample input 
df_input = pd.DataFrame.from_dict({'id_parent': {0: 1100, 1: 1100, 2: 1100, 3: 1100, 4: 1090, 5: 1090, 6: 1080}, 'id_child': {0: 1090, 1: 1080, 2: 1070, 3: 1060, 4: 1080, 5: 1070, 6: 1070}})

#create sample output
df_output = pd.DataFrame.from_dict({'id_parent': {0: 1100, 1: 1090, 2: 1080, 3: 1100}, 'id_child': {0: 1090, 1: 1080, 2: 1070, 3: 1060}})

My current approach would be based on this question: Creating dictionary of parent child pairs in pandas dataframe But maybe there is a simple clean way to solve this without relaying on additional non standard libraries?

score 1 · Accepted Answer · answered Jul 20 '20 at 00:01

1

This worked for me:

# First: group df by child id
grouped  = df_input.groupby(['id_child'], as_index=True).apply(lambda a: a[:])
# Second: Create a new output dataframe
OUTPUT = pd.DataFrame(columns=['id_parent','id_child'])
# Third: Fill it with the unique childs ids and the minimun id for their parent in case of more than one. 
for i,id_ch in enumerate(df_input.id_child.unique()):
    OUTPUT.loc[i] = [min(grouped.loc[id_ch].id_parent), id_ch]

answered Jul 20 '20 at 00:01

Daniela Varela

46
3

Thank you for the answer, I am currently looking into it, but it looks promising! Is there a reason for the groupby at the start? Is it only to get the child ids as dtaframe index or is there another reason I am unware of? – Andreas Jul 20 '20 at 00:23
This works nice as long as the child only has 1 parent. Since i constructed the question this way and didn't specify that a child also can have multiple parents I will accept this answer. Thank you! – Andreas Jul 20 '20 at 01:09

score 1 · Answer 2 · answered Jul 20 '20 at 00:11

1

I could get the result using drop_duplicates

In [6]: df
Out[6]:
   id_parent  id_child
0       1100      1090
1       1100      1080
2       1100      1070
3       1090      1080
4       1090      1070
5       1080      1070

In [9]: df.drop_duplicates(subset=['id_parent']).reset_index(drop=True)
Out[9]:
   id_parent  id_child
0       1100      1090
1       1090      1080
2       1080      1070

answered Jul 20 '20 at 00:11

bigbounty

16,526
5
37
65

You are of course right, to recreate the output this would be sufficient, but i might have used a shallow example. I upvote the answer though because for the initial question this would be the solution. I will update my question now though. – Andreas Jul 20 '20 at 00:18
@Andreas `drop_duplicates` by default only keeps first occurrence. Refer - https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop_duplicates.html – bigbounty Jul 20 '20 at 00:20

Only keep direct parent child id pairs in dataframe

2 Answers2