I have two dataframes with overlapping columns but identical indexes and I want to combine them. I feel like this should be straight forward but I have worked through sooo many examples and SO questions and it's not working but also seems to not be consistent with other examples.
import pandas as pd
# create test data
df = pd.DataFrame({'gen1': [1, 0, 0, 1, 1], 'gen3': [1, 0, 0, 1, 0], 'gen4': [0, 1, 1, 0, 1]}, index = ['a', 'b', 'c', 'd', 'e'])
df1 = pd.DataFrame({'gen1': [1, 0, 0, 1, 1], 'gen2': [0, 1, 1, 1, 1], 'gen3': [1, 0, 0, 1, 0]}, index = ['a', 'b', 'c', 'd', 'e'])
In [1]: df
Out[1]:
gen1 gen2 gen3
a 1 0 1
b 0 1 0
c 0 1 0
d 1 1 1
e 1 1 0
In [2]: df1
Out[2]:
gen1 gen3 gen4
a 1 1 0
b 0 0 1
c 0 0 1
d 1 1 0
e 1 0 1
After working through all the examples here (https://pandas.pydata.org/pandas-docs/stable/merging.html) I'm convinced I have found the correct example (the first and second example of the merges). The second example is this:
In [43]: result = pd.merge(left, right, on=['key1', 'key2'])
In their example they have two DFs (left and right) that have overlapping columns and identical indexs and their resulting dataframe has one version of each column and the original indexs but this is not what happens when I do that:
# get the intersection of columns (I need this to be general)
In [3]: column_intersection = list(set(df).intersection(set(df1))
In [4]: pd.merge(df, df1, on=column_intersection)
Out[4]:
gen1 gen2 gen3 gen4
0 1 0 1 0
1 1 0 1 0
2 1 1 1 0
3 1 1 1 0
4 0 1 0 1
5 0 1 0 1
6 0 1 0 1
7 0 1 0 1
8 1 1 0 1
Here we see that merge has not seen that the indexs are the same! I have fiddled around with the options but cannot get the result I want.
A similar but different question was asked here How to keep index when using pandas merge but I don't really understand the answers and so can't relate it to my problem.
Points for this specific example:
- Index will always be identical.
- Columns with the same name will always have identical entries (i.e. they are duplicates).
It would be great to have a solution for this specific problem but I would also really like to understand it because I find myself spending lots of time on combining dataframes from time to time. I love pandas and in general I find it very intuitive but I just can't seem to get comfortable with anything other than trivial combinations of dataframes.