17

I'm new to Pandas and I want to merge two datasets that have similar columns. The columns are going to each have some unique values compared to the other column, in addition to many identical values. There are some duplicates in each column that I'd like to keep. My desired output is shown below. Adding how='inner' or 'outer' does not yield the desired result.

import pandas as pd

df1 = df2 = pd.DataFrame({'A': [2,2,3,4,5]})

print(pd.merge(df1,df2))

output:
   A
0  2
1  2
2  2
3  2
4  3
5  4
6  5

desired/expected output:
   A
0  2
1  2
2  3
3  4
4  5

Please let me know how/if I can achieve the desired output using merge, thank you!

EDIT To clarify why I'm confused about this behavior, if I simply add another column, it doesn't make four 2's but rather there are only two 2's, so I would expect that in my first example it would also have the two 2's. Why does the behavior seem to change, what's pandas doing?

import pandas as pd
df1 = df2 = pd.DataFrame(
    {'A': [2,2,3,4,5], 'B': ['red','orange','yellow','green','blue']}
)

print(pd.merge(df1,df2))

output:
   A       B
0  2     red
1  2  orange
2  3  yellow
3  4   green
4  5    blue

However, based on the first example I would expect:
   A       B
0  2     red
1  2  orange
2  2     red
3  2  orange
4  3  yellow
5  4   green
6  5    blue
Nic Scozzaro
  • 6,651
  • 3
  • 42
  • 46
  • 1
    Could you please add a less ambiguous example, say with some different data points? – miradulo Feb 24 '17 at 16:57
  • 11
    I've ran into the exact problem before. This situation will happen when you have duplicates in the column you are trying to merge by – AsheKetchum Feb 24 '17 at 17:13
  • The answer I provided will help you get around it with a temporary index. You'll get the desired output, but it is not necessarily the most efficient method. – AsheKetchum Feb 24 '17 at 17:22
  • @Mitch I edited the post to clarify what I'm confused about – Nic Scozzaro Feb 24 '17 at 18:33
  • When you use [`merge`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.merge.html) without specifying the columns to join on, pandas will by default join on all common columns, which is why you're seeing the different behavior in your two examples. – root Feb 24 '17 at 19:49
  • 1
    I don't think `merge` is actually what you want to use, but the question is still a little unclear. What do you expect if `df1` and `df2` have different values? Or will they always be the same? What columns do you want to perform the "merge" on? – root Feb 24 '17 at 19:56

5 Answers5

6
import pandas as pd

dict1 = {'A':[2,2,3,4,5]}
dict2 = {'A':[2,2,3,4,5]}

df1 = pd.DataFrame(dict1).reset_index()
df2 = pd.DataFrame(dict2).reset_index()

df = df1.merge(df2, on = 'A')
df = pd.DataFrame(df[df.index_x==df.index_y]['A'], columns=['A']).reset_index(drop=True)

print(df)

Output:

   A
0  2
1  2
2  3
3  4
4  5
René
  • 4,594
  • 5
  • 23
  • 52
  • 17
    Could you add some comment on what that penultimate line is doing? – Cai Nov 19 '18 at 11:51
  • What happens if I have two different data set? let say: dict1 = {'A':[2,2,3,4,5]} dict2 = {'B':[2,2,3,4,5]}, how do I apply : df = pd.DataFrame(df[df.index_x==df.index_y]['A'], columns=['A']).reset_index(drop=True) of your code to it ? – Wale Jul 15 '21 at 18:23
3

The duplicates are caused by duplicate entries in the target table's columns you're joining on (df2['A']). We can remove duplicates while making the join without permanently altering df2:

df1 = df2 = pd.DataFrame({'A': [2,2,3,4,5]})

join_cols = ['A']

merged = pd.merge(df1, df2[df2.duplicated(subset=join_cols, keep='first') == False],  on=join_cols)

Note we defined join_cols, ensuring columns being joined and columns duplicates are being removed on match.

2
dict1 = {'A':[2,2,3,4,5]}
dict2 = {'A':[2,2,3,4,5]}

df1 = pd.DataFrame(dict1)
df1['index'] = [i for i in range(len(df1))]
df2 = pd.DataFrame(dict2)
df2['index'] = [i for i in range(len(df2))]

df1.merge(df2).drop('index', 1, inplace = True)

The idea is to merge based on the matching indices as well as matching 'A' column values.
Previously, since the way merge works depends on matches, what happened is that the first 2 in df1 was matched to both the first and second 2 in df2, and the second 2 in df1 was matched to both the first and second 2 in df2 as well.

If you try this, you will see what I am talking about.

dict1 = {'A':[2,2,3,4,5]}
dict2 = {'A':[2,2,3,4,5]}

df1 = pd.DataFrame(dict1)
df1['index'] = [i for i in range(len(df1))]
df2 = pd.DataFrame(dict2)
df2['index'] = [i for i in range(len(df2))]

df1.merge(df2, on = 'A')
AsheKetchum
  • 1,098
  • 3
  • 14
  • 29
0

did you try df.drop_duplicates() ?

import pandas as pd

dict1 = {'A':[2,2,3,4,5]}
dict2 = {'A':[2,2,3,4,5]}

df1 = pd.DataFrame(dict1)
df2 = pd.DataFrame(dict2)

df=pd.merge(df1,df2)
df_new=df.drop_duplicates() 
print df
print df_new

Seems that it gives the results that you want

Qehu
  • 135
  • 2
  • 14
0

I have unfortunately stumbled upon a similar problem which I see is now old. I solved it by using this function in a different way, applying it to the two original tables, even though there were no duplicates in these. This is an example (I apologize, I am not a professional programmer):

import pandas as pd

dict1 = {'A':[2,2,3,4,5]}
dict2 = {'A':[2,2,3,4,5]}

df1 = pd.DataFrame(dict1)
df1=df1.drop_duplicates()

df2 = pd.DataFrame(dict2)
df2=df2.drop_duplicates()

df=pd.merge(df1,df2)
print('df1:')
print( df1 )

print('df2:')
print( df2 )

print('df:')
print( df )