Python pandas merge with condition and no duplicates

Question

I have 2 dataframes derived from 2 excel files. The first is a sort of template where there is a column with a condition and the other has the same format but includes inputs for different time periods. I would like to create an output dataframe that basically creates a copy of the template populated with the inputs when the condition is met.

When I use something like df1.merge(df2.assign(Condition='yes'), on=['Condition'], how='left') I sort of get something in line with what I'm after but it contains duplicates. What could I do instead?

thanks

Example below

Code

df1={'reference':[1,2],'condition':['yes','no'],'31/12/2021':[0,0],'31/01/2022':[0,0]}
df1 = pd.DataFrame.from_dict(df1)

df2 = {'reference':[1,2],'condition':["",""],'31/12/2021':[101,231],'31/01/2022':[3423,3242]}
df2 = pd.DataFrame.from_dict(df2)

df1.merge(df2.assign(condition='yes'), on=['condition'], how='left')

Visual example

what is the expected result? why not merging on the reference instead? you can drop using drop_duplicates and specify the columns to evaluates for duplicates in a subset attribute — Naveed, Sep 25 '22 at 16:00
Does this answer your question? [Pandas Merging 101](https://stackoverflow.com/questions/53645882/pandas-merging-101) — BeRT2me, Sep 25 '22 at 17:25

ouroboros1 · Accepted Answer · 2022-09-25T16:12:29.947

1

You could use df.update for this:

# only `update` from column index `2` onwards: ['31/12/2021', '31/01/2022']
df2.update(df1.loc[df1.condition=='no', list(df1.columns)[2:]])
print(df2)

   reference condition  31/12/2021  31/01/2022
0          1                 101.0      3423.0
1          2                   0.0         0.0

Alternative solution using df.where:

df2.iloc[:,2:] = df2.iloc[:,2:].where(df1.condition=='yes',df1.iloc[:,2:])
print(df2)

   reference condition  31/12/2021  31/01/2022
0          1                   101        3423
1          2                     0           0

edited Sep 25 '22 at 16:12

answered Sep 25 '22 at 16:07

ouroboros1

9,113
3
7
26

Thanks @ourobotos1, how can I just update for the data range (e.g. not the headers)? – Tom Sep 25 '22 at 17:42
Not sure if I understand the question. Given your "visual example", doesn't `df2` line up with the desired result (`df_output`)? We're only updating the values for `31/12/2021` and `31/01/2022` in row `0`. Where do you see an update for the headers? – ouroboros1 Sep 25 '22 at 17:52
Thanks. It does, it's just that the actual problem that I'm trying to solve is given an excel template and an input file (with the same format) how to update the template file without changing the dates format. The options that I see are 1) just copy a range where the data is into the template or 2) update the format of the dates before saving back to excel, what do you suggest? – Tom Sep 25 '22 at 18:13
This is really a different question, and there are (as so often) different ways to do this. I would maybe try as follows. Assumption: "reference" is value for cell `A1` in sheet `Sheet1` for a wb `test.xlsx`. Code: `import openpyxl` \n `wb = openpyxl.load_workbook('test.xlsx')` \n `s = wb['Sheet1']` \n `for r in range(df2.shape[0]):` \n *indent* `for c in range(1, df2.shape[1]):` \n *2x indent* `s.cell(row=r+2, column=c+1).value = df2.iloc[r,c]` \n *cut indents again* `wb.save('test.xlsx')` (This is assuming `df2` as at the end in my answer, of course. – ouroboros1 Sep 25 '22 at 18:48

Python pandas merge with condition and no duplicates

1 Answers1