Multiple data frames contains one same column

Question

I am trying to merge 7 different data frames on the basis of same column (accident_no) but the problem is some data frame contains more rows and duplication of (accident_no) e.g

table 1(Accident) contains 200 accident_no (all unique), table 3 contains 196 accident_no (all unique) but table 4 (Person) contains 400 accident_no (some duplications) as there may be multiple passengers were involved in the same crash so accident_no would be same and information can be used for analysis.

The problem I am facing is I have tried concat, join, merge but the answer reaches the highest number of rows and I am getting more rows than 400.

So far I tried below methods:

dfs = [df1,df2,df3,df5,df6,df7]
df_final = reduce(lambda left,right: pd.merge(left,right,on='ACCIDENT_NO', how = 'left'), dfs)

AND

dfs = [df.set_index(['ACCIDENT_NO']) for df in [df1, df2, df3, df4, df5, df6, df7]]

print(pd.concat(dfs, axis=1).reset_index())

So, is it possible that I may get more rows than 400 or am I doing something wrong?

Thanks

Do you capture unique persons in all data frames? Or only for table 4? — Parfait, Oct 09 '21 at 22:08

score 0 · Answer 1 · answered Oct 09 '21 at 21:47

0

you can try ;

table1 = table1.merge(table2,on = ['accident_no'],how = 'left')

and try for other tables.

answered Oct 09 '21 at 21:47

Doğu Can Elçi

23
9

This merge the tables but when I try to merge with other 5 data frames the data exceeds the max number – Chaudhary Zamurad Oct 09 '21 at 22:12

score 0 · Accepted Answer · answered Oct 09 '21 at 22:23

0

Consider creating a person count column with groupby().cumcount() in each data frame, then concatenate on person and accident identifiers:

dfs = [
    (df.assign(
        PERSON_NO = lambda x: x.groupby(["ACCIDENT_NO"]).cumcount().add(1)
       ).set_index(["PERSON_NO", "ACCIDENT_NO"])
    )
    for df in [df1, df2, df3, df4, df5, df6, df7]
]

final_df = pd.concat(dfs, axis=1).reset_index()

answered Oct 09 '21 at 22:23

Parfait

104,375
17
94
125

Thanks for the response mate, the above statement concat all dfs together and replace empty rows with NaN which puts me in a situation with a lot of columns with NaN and I cannot replace it with mean or most frequent as the NaN number is over 50%. – Chaudhary Zamurad Oct 10 '21 at 00:17
Are you asking a different question? Does this solution resolve above question of horizontally merging data frames up to max rows without duplicating ACCIDENT_NO? Regarding imputation of NaNs in columns, consider asking a new question to avoid diverging from this post and confuse future readers. Be sure to include a [reproducible example](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) as it is unclear what any data frame contains or your desired result. – Parfait Oct 10 '21 at 02:25

Multiple data frames contains one same column

2 Answers2