0

I have 2 pandas data frames which have multiple columns.

Some rows have same values in all columns except one column which is updated_at.

I need to merge 2 data frames and consider the latest updated_at value from matched rows. updated_at is a datetime value.

I found a way to merge data frames but not sure how to use the latest value for updated_at column.

pd.merge(
    df1,
    df2,
    how="inner",
    on=["column_1", "column_2", "column_3"])

Here's sample input for df1 and df2.

df1:

MRN  Encounter ID First Name Last Name  Birth Date       updated_at 
1          1234       John       Doe  01/02/1999  04/12/2002 6:00 PM   
2          2345     Joanne       Lee  04/19/2002  04/19/2002 7:22 PM   
3          3456  Annabelle     Jones  01/02/2001  04/21/2002 5:00 PM

df2:

MRN  Encounter ID First Name Last Name  Birth Date       updated_at 
1          1234       John       Doe  01/02/1999  04/12/2002 5:00 PM   
2          2345     Joanne       Lee  04/19/2002  04/19/2002 8:22 PM

final_output:

MRN  Encounter ID First Name Last Name  Birth Date       updated_at 
1          1234       John       Doe  01/02/1999  04/12/2002 6:00 PM   
2          2345     Joanne       Lee  04/19/2002  04/19/2002 8:22 PM   
3          3456  Annabelle     Jones  01/02/2001  04/21/2002 5:00 PM

Notice the updated_at column is having latest value from matched records.

halfer
  • 19,824
  • 17
  • 99
  • 186
Underoos
  • 4,708
  • 8
  • 42
  • 85
  • It will depend on what is the `latest` value, right? The rest is just merging and making a decision which one to keep (from df1 or df2). Is it the case where df2 will always have the 'updated_at' considered to be 'latest'? – Danail Petrov Dec 31 '20 at 11:46
  • `df2` doesn't always have the latest `updated_at` value. Sometimes it could be in matched row in `df1`. – Underoos Dec 31 '20 at 11:50
  • I think you'd need to elaborate a bit more on input and expected output. Otherwise it's alot of guessing involved. Check this https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples – Danail Petrov Dec 31 '20 at 11:57
  • Updated question with input and expected output. – Underoos Dec 31 '20 at 12:07
  • Have you seen this solution https://stackoverflow.com/questions/24614474/pandas-merge-on-name-and-closest-date – Danail Petrov Dec 31 '20 at 12:21

3 Answers3

2

Here is an example that will accomplish this task, I'm pretty sure. Instead of using a merge, use a concat, then a groupby with an agg, as follows.

A = pd.DataFrame({'Name':['John','Joe'], 'Val':['1','3']})

   Name Val
0  John   1
1   Joe   3

B = pd.DataFrame({'Name':['John','Joe'], 'Val':['4','2']})

   Name Val
0  John   4
1   Joe   2

C = pd.concat([A,B]).groupby('Name').agg('max').reset_index()

   Name Val
0   Joe   3
1  John   4

In this example the Val column corresponds to your updated_at columns and the Name column is all the columns that you want to match to group together.

jeffery_the_wind
  • 17,048
  • 34
  • 98
  • 160
2

Something like that can do the job ...

Just make sure your updated_at column is set as datetime

>>> pd.concat([df1,df2]).sort_values('updated_at').drop_duplicates(subset=df1.columns[:-1],keep='last').sort_values('MRN')
    MRN Encounter_ID First_Name   Last_Name  Birth_Date          updated_at
1  1234         John        Doe  01/02/1999  04/12/2002 2020-12-31 06:00:00
2  2345       Joanne        Lee  04/19/2002  04/19/2002 2020-12-31 08:22:00
3  3456    Annabelle      Jones  01/02/2001  04/21/2002 2020-12-31 05:00:00
Danail Petrov
  • 1,875
  • 10
  • 12
1

If you want to include rows from one dataframe not present in the other, then you don't want inner in the merge, you'll need how='outer':

df = df1.merge(df2, how='outer', on=df1.columns[:-1].tolist())

Then you can get the last update with:

df['updated_at'] = df[['updated_at_x', 'updated_at_y']].max(axis=1)

and drop the unnecessary columns:

df = df.drop(columns=['updated_at_x', 'updated_at_y'])

Output:

   MRN  Encounter ID First Name Last Name  Birth Date          updated_at
0    1          1234       John       Doe  01/02/1999 2002-04-12 18:00:00
1    2          2345     Joanne       Lee  04/19/2002 2002-04-19 20:22:00
2    3          3456  Annabelle     Jones  01/02/2001 2002-04-21 17:00:00
Cainã Max Couto-Silva
  • 4,839
  • 1
  • 11
  • 35