Pandas merge creates unwanted duplicate entries

Question

I'm new to Pandas and I want to merge two datasets that have similar columns. The columns are going to each have some unique values compared to the other column, in addition to many identical values. There are some duplicates in each column that I'd like to keep. My desired output is shown below. Adding how='inner' or 'outer' does not yield the desired result.

import pandas as pd

df1 = df2 = pd.DataFrame({'A': [2,2,3,4,5]})

print(pd.merge(df1,df2))

output:
   A
0  2
1  2
2  2
3  2
4  3
5  4
6  5

desired/expected output:
   A
0  2
1  2
2  3
3  4
4  5

Please let me know how/if I can achieve the desired output using merge, thank you!

EDIT To clarify why I'm confused about this behavior, if I simply add another column, it doesn't make four 2's but rather there are only two 2's, so I would expect that in my first example it would also have the two 2's. Why does the behavior seem to change, what's pandas doing?

import pandas as pd
df1 = df2 = pd.DataFrame(
    {'A': [2,2,3,4,5], 'B': ['red','orange','yellow','green','blue']}
)

print(pd.merge(df1,df2))

output:
   A       B
0  2     red
1  2  orange
2  3  yellow
3  4   green
4  5    blue

However, based on the first example I would expect:
   A       B
0  2     red
1  2  orange
2  2     red
3  2  orange
4  3  yellow
5  4   green
6  5    blue

Could you please add a less ambiguous example, say with some different data points? — miradulo, Feb 24 '17 at 16:57
I've ran into the exact problem before. This situation will happen when you have duplicates in the column you are trying to merge by — AsheKetchum, Feb 24 '17 at 17:13
The answer I provided will help you get around it with a temporary index. You'll get the desired output, but it is not necessarily the most efficient method. — AsheKetchum, Feb 24 '17 at 17:22
When you use [`merge`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.merge.html) without specifying the columns to join on, pandas will by default join on all common columns, which is why you're seeing the different behavior in your two examples. — root, Feb 24 '17 at 19:49
I don't think `merge` is actually what you want to use, but the question is still a little unclear. What do you expect if `df1` and `df2` have different values? Or will they always be the same? What columns do you want to perform the "merge" on? — root, Feb 24 '17 at 19:56

René · Accepted Answer · 2017-02-25T19:34:18.637

6

import pandas as pd

dict1 = {'A':[2,2,3,4,5]}
dict2 = {'A':[2,2,3,4,5]}

df1 = pd.DataFrame(dict1).reset_index()
df2 = pd.DataFrame(dict2).reset_index()

df = df1.merge(df2, on = 'A')
df = pd.DataFrame(df[df.index_x==df.index_y]['A'], columns=['A']).reset_index(drop=True)

print(df)

Output:

edited Feb 25 '17 at 19:34

answered Feb 25 '17 at 19:20

René

4,594
5
23
52

17

Could you add some comment on what that penultimate line is doing? – Cai Nov 19 '18 at 11:51
What happens if I have two different data set? let say: dict1 = {'A':[2,2,3,4,5]} dict2 = {'B':[2,2,3,4,5]}, how do I apply : df = pd.DataFrame(df[df.index_x==df.index_y]['A'], columns=['A']).reset_index(drop=True) of your code to it ? – Wale Jul 15 '21 at 18:23

score 3 · Answer 2 · answered Oct 10 '21 at 22:34

The duplicates are caused by duplicate entries in the target table's columns you're joining on (df2['A']). We can remove duplicates while making the join without permanently altering df2:

df1 = df2 = pd.DataFrame({'A': [2,2,3,4,5]})

join_cols = ['A']

merged = pd.merge(df1, df2[df2.duplicated(subset=join_cols, keep='first') == False],  on=join_cols)

Note we defined join_cols, ensuring columns being joined and columns duplicates are being removed on match.

AsheKetchum · Answer 3 · 2017-02-24T17:26:29.840

dict1 = {'A':[2,2,3,4,5]}
dict2 = {'A':[2,2,3,4,5]}

df1 = pd.DataFrame(dict1)
df1['index'] = [i for i in range(len(df1))]
df2 = pd.DataFrame(dict2)
df2['index'] = [i for i in range(len(df2))]

df1.merge(df2).drop('index', 1, inplace = True)

The idea is to merge based on the matching indices as well as matching 'A' column values.
Previously, since the way merge works depends on matches, what happened is that the first 2 in df1 was matched to both the first and second 2 in df2, and the second 2 in df1 was matched to both the first and second 2 in df2 as well.

If you try this, you will see what I am talking about.

dict1 = {'A':[2,2,3,4,5]}
dict2 = {'A':[2,2,3,4,5]}

df1 = pd.DataFrame(dict1)
df1['index'] = [i for i in range(len(df1))]
df2 = pd.DataFrame(dict2)
df2['index'] = [i for i in range(len(df2))]

df1.merge(df2, on = 'A')

Qehu · Answer 4 · 2017-02-24T17:36:37.403

0

did you try df.drop_duplicates() ?

import pandas as pd

dict1 = {'A':[2,2,3,4,5]}
dict2 = {'A':[2,2,3,4,5]}

df1 = pd.DataFrame(dict1)
df2 = pd.DataFrame(dict2)

df=pd.merge(df1,df2)
df_new=df.drop_duplicates() 
print df
print df_new

Seems that it gives the results that you want

edited Feb 24 '17 at 17:36

answered Feb 24 '17 at 17:03

Qehu

135
2
14

2

I know you don't have the rep to comment yet, but this is not an answer. – miradulo Feb 24 '17 at 17:09
6

Have mercy on the guy – AsheKetchum Feb 24 '17 at 17:10
2

@AsheKetchum "Mercy" upvoting is probably not healthy. This is not an answer. – miradulo Feb 24 '17 at 17:11
That is True :) – AsheKetchum Feb 24 '17 at 17:13
Actually for this specific problem this is an answer. The fact that i use "?" doesnt make it less an answer. anyway now i can comment :) – Qehu Feb 24 '17 at 17:15
You can only comment on your own answers and questions as of now :P but you'll get there lol. – AsheKetchum Feb 24 '17 at 17:19
@Mitch Either way I think it's not as bad as down voting for no reason :) – AsheKetchum Feb 24 '17 at 17:20
@AsheKetchum There is a reason, though. I don't find this answer useful. – miradulo Feb 24 '17 at 17:21
@Mitch the fact that guides somehow to a direction or something is redundant? – Qehu Feb 24 '17 at 17:23
@Mitch Fair enough, care to elaborate? Otherwise it wouldn't be too unfair to say that the community might not find you useful – AsheKetchum Feb 24 '17 at 17:23
@AsheKetchum I'm thrilled you speak on behalf of the community. – miradulo Feb 24 '17 at 17:29
@Mitch that's irrelevant, the purpose it to provide help and help people improve. You haven't offered any of that. – AsheKetchum Feb 24 '17 at 17:30
@AsheKetchum Because OP's question is still unclear. – miradulo Feb 24 '17 at 17:31
@Mitch The fact that you do not understand it doesn't necessarily make it unclear. I have provided another answer as well. If you bothered to take a look at it and maybe run the code yourself, you will see that it is, in fact, rather clear. – AsheKetchum Feb 24 '17 at 17:32
Hope after my edit, Mr@Mitch has a more clear opinion about my answer – Qehu Feb 24 '17 at 17:38
2

`drop_duplicates` will not produce the desired result. The desired result has two instances of 2 in it, and `drop_duplicates` will result in only one instance of 2. – root Feb 24 '17 at 17:43
@Uheq Yes, it is wrong. And your extended example has demonstrated it is wrong. – miradulo Feb 24 '17 at 17:51

score 0 · Answer 5 · answered Nov 06 '20 at 19:37

I have unfortunately stumbled upon a similar problem which I see is now old. I solved it by using this function in a different way, applying it to the two original tables, even though there were no duplicates in these. This is an example (I apologize, I am not a professional programmer):

import pandas as pd

dict1 = {'A':[2,2,3,4,5]}
dict2 = {'A':[2,2,3,4,5]}

df1 = pd.DataFrame(dict1)
df1=df1.drop_duplicates()

df2 = pd.DataFrame(dict2)
df2=df2.drop_duplicates()

df=pd.merge(df1,df2)
print('df1:')
print( df1 )

print('df2:')
print( df2 )

print('df:')
print( df )

Pandas merge creates unwanted duplicate entries

5 Answers5

Linked