2

I have two dataframes with overlapping columns but identical indexes and I want to combine them. I feel like this should be straight forward but I have worked through sooo many examples and SO questions and it's not working but also seems to not be consistent with other examples.

import pandas as pd
# create test data
df = pd.DataFrame({'gen1': [1, 0, 0, 1, 1], 'gen3': [1, 0, 0, 1, 0], 'gen4': [0, 1, 1, 0, 1]}, index = ['a', 'b', 'c', 'd', 'e'])
df1 = pd.DataFrame({'gen1': [1, 0, 0, 1, 1], 'gen2': [0, 1, 1, 1, 1], 'gen3': [1, 0, 0, 1, 0]}, index = ['a', 'b', 'c', 'd', 'e'])

In [1]: df
Out[1]: 
   gen1  gen2  gen3
a     1     0     1
b     0     1     0
c     0     1     0
d     1     1     1
e     1     1     0

In [2]: df1
Out[2]: 
   gen1  gen3  gen4
a     1     1     0
b     0     0     1
c     0     0     1
d     1     1     0
e     1     0     1

After working through all the examples here (https://pandas.pydata.org/pandas-docs/stable/merging.html) I'm convinced I have found the correct example (the first and second example of the merges). The second example is this:

In [43]: result = pd.merge(left, right, on=['key1', 'key2'])

In their example they have two DFs (left and right) that have overlapping columns and identical indexs and their resulting dataframe has one version of each column and the original indexs but this is not what happens when I do that:

# get the intersection of columns (I need this to be general)
In [3]: column_intersection = list(set(df).intersection(set(df1))

In [4]: pd.merge(df, df1, on=column_intersection)
Out[4]: 
   gen1  gen2  gen3  gen4
0     1     0     1     0
1     1     0     1     0
2     1     1     1     0
3     1     1     1     0
4     0     1     0     1
5     0     1     0     1
6     0     1     0     1
7     0     1     0     1
8     1     1     0     1

Here we see that merge has not seen that the indexs are the same! I have fiddled around with the options but cannot get the result I want.

A similar but different question was asked here How to keep index when using pandas merge but I don't really understand the answers and so can't relate it to my problem.

Points for this specific example:

  • Index will always be identical.
  • Columns with the same name will always have identical entries (i.e. they are duplicates).

It would be great to have a solution for this specific problem but I would also really like to understand it because I find myself spending lots of time on combining dataframes from time to time. I love pandas and in general I find it very intuitive but I just can't seem to get comfortable with anything other than trivial combinations of dataframes.

cs95
  • 379,657
  • 97
  • 704
  • 746
ojunk
  • 879
  • 8
  • 21

2 Answers2

3

Starting v0.23, you can specify an index name for the join key, if you have it.

df.index.name = df1.index.name = 'idx'
df.merge(df1, on=list(set(df).intersection(set(df1)) | {'idx'}))

     gen1  gen3  gen4  gen2
idx                        
a       1     1     0     0
b       0     0     1     1
c       0     0     1     1
d       1     1     0     1
e       1     0     1     1 

The assumption here is that your actual DataFrame does not have exactly the same values in overlapping columns. If they did, then your question would be one of concatenation— you can use pd.concat for that:

c = list(set(df).intersection(set(df1)))
pd.concat([df1, df.drop(c, 1)], axis=1)

     gen1  gen2  gen3  gen4                     
a       1     0     1     0
b       0     1     0     1
c       0     1     0     1
d       1     1     1     0
e       1     1     0     1
cs95
  • 379,657
  • 97
  • 704
  • 746
  • Thanks @coldspeed but your line `df.merge(df1, on=list(set(df).intersection(set(df1)) | {'idx'}))` returns `KeyError: 'idx'`. The indexs are both named 'idx' is it perhaps that `df['idx']` will return the same error? – ojunk Dec 07 '18 at 12:52
  • @ojunk actually I think your pandas may need to be updated, this works on the latest version only. – cs95 Dec 07 '18 at 16:52
  • yes you're right - sorry you did say in your answer I just forgot. Thanks! – ojunk Dec 07 '18 at 17:11
2

In this special case, you can use assign
Things in df take priority but all other things in df1 are included.

df1.assign(**df)

   gen1  gen2  gen3  gen4
a     1     0     1     0
b     0     1     0     1
c     0     1     0     1
d     1     1     1     0
e     1     1     0     1

**df unpacks df assuming a dictionary context. This unpacking delivers keyword arguments to assign with the names of columns as the keyword and the column as the argument.

It is the same as

df1.assign(gen1=df.gen1, gen3=df.gen3, gen4=df.gen4)
piRSquared
  • 285,575
  • 57
  • 475
  • 624