I would like to merge two DataFrames on the index (thus join()
).
But the two DataFrames have about 20 columns, exactly the same.
And I would like to avoid having duplicated columns as making decisions on which to keep, dropping half of them and renaming others might be cumbersome.
My aim is to make one DataFrame (I'm calling it "old") as a sort of history of all previous DataFrames. So I'm building a new dataframe, and then merging it into it's older self, etc. etc. The old DataFrame will increase over time with the iterations.
Here is an simplified example
import pandas as pd
df = pd.DataFrame({'A': [1,2,3],
'B': [4,5,6],
'C':[7,8,9]}
).set_index([[11,22,33]])
old = df.head(2)
new = df.tail(2)
print( old.join(new,how='outer', lsuffix='_left') )
Which gives :
A_left B_left C_left A B C
11 1.0 4.0 7.0 NaN NaN NaN
22 2.0 5.0 8.0 2.0 5.0 8.0
33 NaN NaN NaN 3.0 6.0 9.0
11: I know that if the ID doesn't exist in new, it should just be kept, not create duplicate variables with NaN's.
22: if the ID exist in both, then old values should be overwritten ; Discard _left columns, keep _right's.
33: if the ID doesn't exist in old but in new, just append
I have searched the docs quite a lot for this but couldn't find anything.
My best idea so far is to do this join with suffixes and then apply filters: if cols A_left, B_left C_left are NaN then copy values in A,B,C. Delete cols A_left, B_left C_left etc.
This doesn't seem like a good efficient solution.
Or maybe append them, sort_values and then remove duplicate id's ?
As I'm new to Python this might not be the best approach, let me know.