1

I am running into the following issue. I want to merge two dataframes on multiple columns(11 to be exact). Surprisingly the usual methods do not work. Example dataframes are as follows: df1:

c1  c2  c3  c4  c5  c6  event_count
1   2   a   ff  0   1   5
1   2   b   fg  1   3   6
1   2   c   hg  2   4   20
1   2   d   gf  0   1   7
1   2   e   fg  1   4   1

df2:

c1  c2  c3  c4  c5  c6  event_type  event_price
1   2   a   ff  0   1   a           20
1   2   b   fg  1   3   a           20
1   2   c   hg  2   4   b           30
1   2   d   gf  0   1   b           40
1   2   e   fg  1   4   b           50

I tried

result = pd.merge(df1, df2, how='outer', on = ['c1','c2','c3','c4','c5','c6'])

and also this should technically work:

result = pd.merge(df1, df2, how='outer')

But I get this:

c1  c2  c3  c4  c5  c6  event_count event_type  event_price
1   2   a   ff  0   1   5       
1   2   b   fg  1   3   6       
1   2   c   hg  2   4   20      
1   2   d   gf  0   1   7       
1   2   e   fg  1   4   1       
1   2   a   ff  0   1               a           20
1   2   b   fg  1   3               a           20
1   2   c   hg  2   4               b           30
1   2   d   gf  0   1               b           40
1   2   e   fg  1   4               b           50

When I want to get this:

c1  c2  c3  c4  c5  c6  event_count event_type  event_price
1   2   a   ff  0   1   5            a          20
1   2   b   fg  1   3   6            a          20
1   2   c   hg  2   4   20           b          30
1   2   d   gf  0   1   7            b          40
1   2   e   fg  1   4   1            b          50

I also tried with indexes:

df1_index = df1.set_index(['c1', 'c2','c3'....]); df2_index =df2.set_index(['c1', 'c2','c3'....]); result = pd.concat([df1_index, df2_index], axis=1);

But that gave the same wrong result. I made sure the dtypes of the two tables are the same as well. I am not sure what else to try. Any advice? Thanks :)

Maria Petrova
  • 549
  • 2
  • 6
  • 10

2 Answers2

3

The behavior you describe can happen if the values in one DataFrame are numeric strings, while in the other are numerals. They look the same but compare differently.

For example, consider df1 and df2 below. Their c columns are identical except that df1['c1'] contains strings, while df2['c1'] contains ints.

import pandas as pd
df1 = pd.DataFrame({'c1': ['1', '1', '1', '1', '1'],
 'c2': [2, 2, 2, 2, 2],
 'c3': ['a', 'b', 'c', 'd', 'e'],
 'c4': ['ff', 'fg', 'hg', 'gf', 'fg'],
 'c5': [0, 1, 2, 0, 1],
 'c6': [1, 3, 4, 1, 4],
 'event_count': [5, 6, 20, 7, 1]})

df2 = pd.DataFrame({'c1': [1, 1, 1, 1, 1],
 'c2': [2, 2, 2, 2, 2],
 'c3': ['a', 'b', 'c', 'd', 'e'],
 'c4': ['ff', 'fg', 'hg', 'gf', 'fg'],
 'c5': [0, 1, 2, 0, 1],
 'c6': [1, 3, 4, 1, 4],
 'event_price': [20, 20, 30, 40, 50],
 'event_type': ['a', 'a', 'b', 'b', 'b']})

Merging fails to match any rows because 1 != '1':

print(pd.merge(df1, df2, how='outer'))
#   c1  c2 c3  c4  c5  c6  event_count  event_price event_type
# 0  1   2  a  ff   0   1            5          NaN        NaN
# 1  1   2  b  fg   1   3            6          NaN        NaN
# 2  1   2  c  hg   2   4           20          NaN        NaN
# 3  1   2  d  gf   0   1            7          NaN        NaN
# 4  1   2  e  fg   1   4            1          NaN        NaN
# 5  1   2  a  ff   0   1          NaN           20          a
# 6  1   2  b  fg   1   3          NaN           20          a
# 7  1   2  c  hg   2   4          NaN           30          b
# 8  1   2  d  gf   0   1          NaN           40          b
# 9  1   2  e  fg   1   4          NaN           50          b

If we convert the numeric strings to numerals, then merging succeeds as expected:

df1 = df1.convert_objects(numeric=True)
df2 = df2.convert_objects(numeric=True) # not necessary in my example, but may be necessary for you if the numeric strings are in df2.
print(pd.merge(df1, df2, how='outer'))

yields

   c1  c2 c3  c4  c5  c6  event_count  event_price event_type
0   1   2  a  ff   0   1            5           20          a
1   1   2  b  fg   1   3            6           20          a
2   1   2  c  hg   2   4           20           30          b
3   1   2  d  gf   0   1            7           40          b
4   1   2  e  fg   1   4            1           50          b

While df1 = df1.convert_objects(numeric=True) may fix your problem, it would be better to go back and fix the problem at its source, when df1 and/or df2 are first defined.

unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677
  • Thank you! That makes sense. I also checked and changed all data types in both data frames to the same to make sure this was not the case but I used astype() not convert_objects. I will try with convert_object to see if that works better for me. But you are also right that is it best to fix the problem at the source. This data come from dbb queries and if I just pass the dataframes without doing any modifications they are merged correctly. – Maria Petrova Jul 31 '15 at 15:35
1

Why are you doing an outer join? As far as I got you need an inner join. This df1.merge(df2) gives me:

    c1  c2  c3  c4  c5  c6  event_count event_type  event_price
0   1   2   a   ff  0   1   5              a            20
1   1   2   b   fg  1   3   6              a            20
2   1   2   c   hg  2   4   20             b            30
3   1   2   d   gf  0   1   7              b            40
4   1   2   e   fg  1   4   1              b            50
foebu
  • 1,365
  • 2
  • 18
  • 35
  • Might not be why OP was doing it, but I do 'outer' joins the first several times until I'm sure I'm not doing the join wrong. – rajan Sep 27 '22 at 18:47