13

I have two dataframes that I would like to concatenate column-wise (axis=1) with an inner join. One of the dataframes has some duplicate indices, but the rows are not duplicates, and I don't want to lose the data from those :

df1 = pd.DataFrame([{'a':1,'b':2},{'a':1,'b':3},{'a':2,'b':4}],
                   columns = ['a','b']).set_index('a')

df2 = pd.DataFrame([{'a':1,'c':5},{'a':2,'c':6}],columns = ['a','c']).set_index('a')

>>> df1
   b
a   
1  2
1  3
2  4
8  9

>>> df2
   c
a   
1  5
2  6

The default concat behavior is to fill missing values with NaNs:

>>> pd.concat([df1,df2])
    b   c
a
1   2 NaN
1   3 NaN
2   4 NaN
1 NaN   5
2 NaN   6

I want to keep the duplicate indices from df1 and fill them with duplicated values from df2, but in pandas 0.13.1 an inner join on the columns produces an error. In more recent versions of pandas concat does what I want:

>>> pd.concat([df1, df2], axis=1, join='inner')
   b  c
a      
1  2  5
1  3  5
2  4  6

What's the best way to achieve the result I want? Is there a groupby solution? Or maybe I shouldn't be using concat at all?

hobs
  • 18,473
  • 10
  • 83
  • 106
andbeonetraveler
  • 693
  • 3
  • 11
  • 25

2 Answers2

8

You can perform a merge and set the params to use the index from the lhs and rhs:

In [4]:    
df1.merge(df2, left_index=True, right_index=True)
Out[4]:
   b  c
a      
1  2  5
1  3  5
2  4  6

[3 rows x 2 columns]

Concat should've worked, it worked for me:

In [5]:

pd.concat([df1,df2], join='inner', axis=1)
Out[5]:
   b  c
a      
1  2  5
1  3  5
2  4  6

[3 rows x 2 columns]
EdChum
  • 376,765
  • 198
  • 813
  • 562
  • Oh really? I'm running 0.13.1. Nice that they've apparently fixed this in 0.14.0. I'll test your solutions before and after an upgrade and then accept when I'm done. Thanks! – andbeonetraveler Jul 10 '14 at 19:39
  • Ok, merge works in both 0.13.1 and 0.14.0, so thanks for that solution. Concat gives me the 'cannot reindex from a duplicate index' error in 0.13.1, and in 0.14.0 I get: 'ValueError: Shape of passed values is (2, 5), indices imply (2, 3)'. I wonder why...? I'm running python 3.3.5. – andbeonetraveler Jul 10 '14 at 21:25
  • @andbeonetraveler not sure I'm running numpy 1.8.1 and python 64-bit 3.3 – EdChum Jul 10 '14 at 21:26
  • I have version 0.16.2 and with multi-index none of the solutions does not work for me; merge gives a dataframe as output but with many columns of the left array as NaN - and they should not be. – Dr Fabio Gori Sep 25 '15 at 18:38
  • @IgorFobia if you have a question then post a question, your situation is different to here, SO is not a forum, it's impossible for me to help you based on it doesn't work for me – EdChum Sep 25 '15 at 19:41
0

Coming back to this because I was looking how to merge on columns with different names and keep duplicates:

df1 = pd.DataFrame([{'a':1,'b':2},{'a':1,'b':3},{'a':2,'b':4}],
                   columns = ['a','b'])
df1
   a  b
0  1  2
1  1  3
2  2  4

df2 = pd.DataFrame([{'c':1,'d':5},{'c':2,'d':6}],
                   columns = ['c','d'])
df2
   c  d
0  1  5
1  2  6

And found that pd.merge(df1, df2.set_index('c'), left_on='a', right_index=True) accomplished this:

df3
   a  b  d
0  1  2  5
1  1  3  5
2  2  4  6

You could also .set_index('a'), left_on='a' if the column names are the same (as per OP example)

kas
  • 285
  • 1
  • 8