1

I have two dataframes df1 and df2 which have the duplicates rows in both. I want to merge these dfs. What i tried so far is to remove duplicates from one of the dataframe df2 as i need all the rows from the df1.

Question might be a duplicate one but i didn't find any solution/hints for this particular scenario.

data = {'Name':['ABC', 'DEF', 'ABC','MNO', 'XYZ','XYZ','PQR','ABC'],
        'Age':[1,2,3,4,2,1,2,4]}
data2 = {'Name':['XYZ', 'NOP', 'ABC','MNO', 'XYZ','XYZ','PQR','ABC'],
        'Sex':['M','F','M','M','M','M','F','M']}
df1 = pd.DataFrame(data)
df2 = pd.DataFrame(data2)

dfn = df1.merge(df2.drop_duplicates('Name'),on='Name')
print(dfn) 

Result of above snippet:

  Name  Age Sex
0  ABC    1   M
1  ABC    3   M
2  ABC    4   M
3  MNO    4   M
4  XYZ    2   M
5  XYZ    1   M
6  PQR    2   F

This works perfectly well for the above data, but i have a large data and this method is behaving differently as im getting lots more rows than expected in dfn

I suspect due to large data and more duplicates im getting those extra rows but im cannot afford to delete the duplicate rows from df1.

Apologies as im not able to share the actual data as it is too large! Edit: A sample result from the actual data: df2 after removing dups and the result dfn and i have only one entry in df1 for both ABC and XYZ:

enter image description hereenter image description here

Thanks in advance!

Mr.B
  • 51
  • 6
  • have you tried the `df.join` method? – Yiannis Oct 25 '21 at 20:14
  • No, i will have a look. Thanks! – Mr.B Oct 25 '21 at 20:19
  • 2
    We can only guess if your example doesn't show your problem. `pandas` doesn't change behaviour for larger datasets. – Michael Szczesny Oct 25 '21 at 20:20
  • this method should not return more rows than df1 at least, as no more dup in df2 and the merge is inner by default. is it the case? what do you mean then by more rows than expected – Ben.T Oct 25 '21 at 20:22
  • @MichaelSzczesny i have added a part of of the actual data. Could you pls check? – Mr.B Oct 25 '21 at 20:34
  • Forgive me @MichaelSzczesny i dint get you. Do you mean the new example should produce the correct result even if the df1 is having dups? But it is producing the duplicate rows even the df1 as well as df2 have only one entry for 'NAME' ABC and XYZ. – Mr.B Oct 25 '21 at 20:46
  • Please include a [MRE] that actually shows your problem. – Michael Szczesny Oct 25 '21 at 20:51

1 Answers1

1

Try to drop_duplicates from df1 too:

dfn = pd.merge(df1, df2.drop_duplicates('Name'),
               on='Name', how='left)
Corralien
  • 109,409
  • 8
  • 28
  • 52
  • I cannot do that as i need all the data from df1. – Mr.B Oct 25 '21 at 20:21
  • 1
    So use `how='left'` as argument of `merge` – Corralien Oct 25 '21 at 20:21
  • I will try this as well, thank you! – Mr.B Oct 25 '21 at 20:26
  • how='left' worked perfectly for me and thanks so much for that, i did tried to find what exactly this doing. But still i would like to understand in more simpler words from you. – Mr.B Oct 27 '21 at 17:38
  • 1
    `how=left` means all rows from df1 will be kept. Read this almost perfect Q/A: [Pandas Merging 101](https://stackoverflow.com/questions/53645882/pandas-merging-101). Animations are great to understand how each method works. – Corralien Oct 27 '21 at 18:05