Repeatedly merging Pandas dataframe in python loses most data after the 3 merge

Question

I am doing a set of merging on two dataframes. One data frame contains a two columns which shows which a parent child connection. Like this

RID	Child_RID
xxxxx	yyyyy
yyyyy	zzzzz

Where a child will appear in the RID section to as they may also have children

I have a second dataframe that has more information on an object like this

name	RID	Child_1
site	xxxx	yyyy
site	xxxx	oooo
somthing	iiii	llll

I wrote a merging function to reuse as i will keep doing it for a number of times until there are no children left.

And for the first merge it works fine and produces what I want. But on the second and third merge it drops information. I would expect for it to have nan if there is missing data which is what I want. but it erases pretty much parts or all the of previous merges.

def merge_dataset(df1, list_df, index_1, index_2):
    num_df = pd.DataFrame(list_df, columns = [index_1,index_2])
    df2 = pd.merge(df1, num_df, on=index_1)
    return df2

And this is how I use this method

mg1 = merge_dataset(df, parent_child_longlist, 'child_1','child_2')
# Change column order
cols = ['name', 'rid', 'child_1','child_2']
mg1 = mg1[cols]

And this works fine producing this

The second merge seemed to drop information from the dataframe which I dont know why. I would expect for it to have more rows than the first merge I called the second merge like the first one

mg2 = merge_dataset(mg1, parent_child_longlist, 'child_2','child_3')
cols = ['name', 'rid', 'child_1','child_2', 'child_3']
mg2 = mg2[cols]

The issue is magnified with the third merge. I call it the same way

mg3 = merge_dataset(mg2, parent_child_longlist, 'child_3','child_4')
cols = ['name', 'rid', 'child_1','child_2', 'child_3', 'child_4']
mg3 = mg3[cols]

But then drops 90% of the data like this

So I am a bit lost as to why it would drop all the data after this third merge when the first two seems to work just fine. And i will need to do this 10 or 20 more times until all the parent child connections are mapped in this dataframe.

maybe try with the parameter `how='left'` or `'outer'` in the merge because by default, it does a `'inner'` merge :) — Ben.T, Apr 29 '21 at 15:20
dude you are god sent. I tried outer and left before but it failed on me and I got back to basics this round. Setting how to outer fixed the issue and did it perfect. Could you post that as the answer and I will accept — MNM, Apr 29 '21 at 15:27
I will more flag as dup to redirect towards pandas merge 101 and leave the comment to help future readers :) — Ben.T, Apr 29 '21 at 15:57

Repeatedly merging Pandas dataframe in python loses most data after the 3 merge

0 Answers0