0

I want to join two dataframe (i.e df_1, df_2) based on date so that a new df_1_2 formed with column as Date, A, B, C where all dates with their resppective values are present without any duplicate value

Current code:

import pandas as pd

# create dictionary

dict1 = {"Date":["2000-01-01", "2000-01-04", "2000-01-05", "2000-01-07"], "A":[99, 93,100,97], "B": [106,107,109,105]}
dict2 = {"Date":["2000-01-01", "2000-01-03", "2000-01-05", "2000-01-07"], "A":[99, 96,100,97], "B": [106,100,109,105], "C":[2,5,8,4]}

# create dataframe using dict1

df_1 = pd.DataFrame(dict1)
df_1["Date"] = pd.to_datetime(df_1["Date"])
df_1.set_index("Date", inplace = True)

# create dataframe using dict2

df_2 = pd.DataFrame(dict2)
df_2["Date"] = pd.to_datetime(df_2["Date"])
df_2.set_index("Date", inplace = True)

# concat df_1 & df_2

df_1_2 = pd.concat([df_1, df_2])
print(df_1_2)

Expected output:

enter image description here

Michael S.
  • 3,050
  • 4
  • 19
  • 34
  • Please clarify your specific problem or provide additional details to highlight exactly what you need. As it's currently written, it's hard to tell exactly what you're asking. – Community Aug 03 '23 at 17:29
  • 2
    try checking `combine_first` function. https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.combine_first.html – Emma Aug 03 '23 at 17:34
  • Does this answer your question? [join or merge with overwrite in pandas](https://stackoverflow.com/questions/9787853/join-or-merge-with-overwrite-in-pandas) – OCa Aug 03 '23 at 18:06

2 Answers2

1

Emma's solution of using combine_first is the most straight forward and most likely the best:

df_1.combine_first(df_2)

You can also look into merge. Is there a reason you have to set Date as the index in the first portion? Regardless, if you reset the indexes, merge them on Date, A, and B via outer join and then set the index back to Date, you can get the same result as combine_first. Just an alternative for more options:

(df_1.reset_index().merge(df_2.reset_index(), on = ['Date', 'A', 'B'], how = 'outer')
 .set_index('Date')
 .sort_index())

Output for both:

             A   B   C
Date            
2000-01-01  99  106 2.0
2000-01-03  96  100 5.0
2000-01-04  93  107 NaN
2000-01-05  100 109 8.0
2000-01-07  97  105 4.0
Michael S.
  • 3,050
  • 4
  • 19
  • 34
0

Do I get this right?

  • By duplicates, you mean identical rows regardless 'C'
  • In case of such duplicate, you would like to keep the row from df_2 because it has extra information ('C')

If that is correct, because those are not duplicates over all columns,

  • do the concatenation as you did df_1_2 = pd.concat([df_1, df_2])
  • then handle your duplicates (mark rows, then remove)
# Make sure rows from df_1_2 are place on top
df_1_2.sort_values(by='C', inplace=True)

# Mark duplicates with regards to 'A' and 'B'
df_1_2['duplicate'] = df_1_2.drop(['C'], axis=1).duplicated()

Intermediate output:

              A    B    C  duplicate
Date                                
2000-01-01   99  106  2.0      False
2000-01-07   97  105  4.0      False
2000-01-03   96  100  5.0      False
2000-01-05  100  109  8.0      False
2000-01-01   99  106  NaN       True
2000-01-04   93  107  NaN      False
2000-01-05  100  109  NaN       True
2000-01-07   97  105  NaN       True
# Remove duplicates, drop 'duplicate' temporary column, then restore order by date (index). 
df_1_2.loc[~df_1_2.duplicate].drop('duplicate', axis=1).sort_index()

Output:

              A    B    C
Date                     
2000-01-01   99  106  2.0
2000-01-03   96  100  5.0
2000-01-04   93  107  NaN
2000-01-05  100  109  8.0
2000-01-07   97  105  4.0
OCa
  • 298
  • 2
  • 13