1

I would like to merge two DataFrames on the index (thus join()).
But the two DataFrames have about 20 columns, exactly the same. And I would like to avoid having duplicated columns as making decisions on which to keep, dropping half of them and renaming others might be cumbersome.

My aim is to make one DataFrame (I'm calling it "old") as a sort of history of all previous DataFrames. So I'm building a new dataframe, and then merging it into it's older self, etc. etc. The old DataFrame will increase over time with the iterations.

Here is an simplified example

import pandas as pd
df = pd.DataFrame({'A': [1,2,3],
              'B': [4,5,6],
              'C':[7,8,9]}
     ).set_index([[11,22,33]])

old = df.head(2)
new = df.tail(2)

print( old.join(new,how='outer', lsuffix='_left') )

Which gives :

    A_left  B_left  C_left    A    B    C
11     1.0     4.0     7.0  NaN  NaN  NaN
22     2.0     5.0     8.0  2.0  5.0  8.0
33     NaN     NaN     NaN  3.0  6.0  9.0
  • 11: I know that if the ID doesn't exist in new, it should just be kept, not create duplicate variables with NaN's.

  • 22: if the ID exist in both, then old values should be overwritten ; Discard _left columns, keep _right's.

  • 33: if the ID doesn't exist in old but in new, just append

I have searched the docs quite a lot for this but couldn't find anything.

My best idea so far is to do this join with suffixes and then apply filters: if cols A_left, B_left C_left are NaN then copy values in A,B,C. Delete cols A_left, B_left C_left etc.
This doesn't seem like a good efficient solution.

Or maybe append them, sort_values and then remove duplicate id's ?

As I'm new to Python this might not be the best approach, let me know.

stallingOne
  • 3,633
  • 3
  • 41
  • 63
  • You might just be looking for `new.combine_first(old)`; am I correct in thinking that if new's 22/C were 99, you'd want 99 in the output? – DSM Jul 16 '18 at 13:14
  • @DSM yes you are correct. Also if new/22/B was NaN and new/22/C was 99 I would expect 5 and 99 respectively. Would combine_first still work ? – stallingOne Jul 16 '18 at 13:30
  • yes. See [here](https://stackoverflow.com/questions/9787853/join-or-merge-with-overwrite-in-pandas). – DSM Jul 16 '18 at 14:14
  • @DSM Your answer is just perfect! Exactly what I needed and so small it's incredible. Thanks! You can make an answer of it and I'll mark it as solved. – stallingOne Jul 18 '18 at 08:23

2 Answers2

3

------------------Edit after comments-------------------------------

First option, full code: It keeps indices from both, while updating rows that have the same index but different values with values from new.

import pandas as pd
old = pd.DataFrame({'A': [2,3,4],
              'B': [5,6,4],
              'C':[8,9,4]}
     ).set_index([[22,33,44]])

new = pd.DataFrame({'A': [1,2,3],
              'B': [44,55,66],
              'C':[7,8,9]}
     ).set_index([[11,22,33]])

new
    A   B   C
11  1   44  7
22  2   55  8
33  3   66  9

old
    A   B   C
22  2   5   8
33  3   6   9
44  4   4   4

pd.merge(new, old, on=['A','B','C'], how='outer', right_index=True, left_index=True)

output:
    A   B   C
11  1   44  7
22  2   55  8
33  3   66  9
44  4   4   4

Have you tried merge?

    pd.merge(old, new, on=['A','B','C'], how='outer', left_index=True, right_index=True))

Output:

        A   B   C
    11  1   4   7
    22  2   5   8
    33  3   6   9

Option 2: Use append and drop duplicates:

new.append(old).drop_duplicates()
Maria
  • 159
  • 10
  • Yes I tried. Syntax is different but it comes to the same if you specify `how='outer', suffixes=['_l', '_r'], left_index=True, right_index=True`. Your example here (`pd.merge(old, new, on=['A','B','C'], how='outer')`) doesn't use indexes but all columns, that's not really the same. – stallingOne Jul 16 '18 at 13:02
  • I edited it to use the index of each df. do the values i ABC map to an index (one to one)? – Maria Jul 16 '18 at 13:11
  • EDIT: It will not map 1to1 in all cases, since first "old" might have more or less info than "new" in certain columns. – stallingOne Jul 16 '18 at 13:16
  • 1
    The problem is that it seems to keep the left (old) values if they exist (not NaN I suppose?), regardless if they exist in right (new) df. – stallingOne Jul 16 '18 at 13:17
  • So, you wish to have a new df containing only indices present in new, while updating values from old to new if the index is mutual? – Maria Jul 16 '18 at 13:44
  • first statement is false : I wish indices present in **both** (`how='outer'`). second statement is correct (updating values from old to new if the index is mutual). Also : updating NaN's if a value is present in any df. I think your option 2 is best, with a sort on index and then on values ; between append() and drop_duplicates(). Will try it. – stallingOne Jul 16 '18 at 14:01
  • I think append and drop won't work for you, since it doesn't update old with new. Check out the edit I made. – Maria Jul 16 '18 at 14:40
0

You can try this, I think it will work for you!

import pandas as pd

df = pd.DataFrame({'A': [1,2,3,4],
              'B': [4,5,6,7],
              'C':[7,8,9,10],
              'D':[10,11,12,14]}
     ).set_index([[11,22,33,44]])

df2 = pd.DataFrame({'A': [1,2,3,4],
              'B': [4,5,6,8],
              'C':[11,12,13,15],
              'D':[14,15,16,17]}
     ).set_index([[11,22,33,44]])
old = df.head(3)
new = df2.tail(3)

intersection = list(set(list(new.index)).intersection(list(old.index)))
old.loc[intersection] = new.loc[intersection]
only_new = [x for x in list(new.index) if x not in list(old.index)]

old.loc[only_new] = new.loc[only_new]
kaihami
  • 815
  • 7
  • 18