merging two excel files and then removing duplicates that it creates

Question

I've just started using python so could do with some help.

I've merged data in two excel files using the following code:

# Import pandas library
import pandas as pd

#import excel files
df1 = pd.read_excel("B2 teaching.xlsx")
df2 = pd.read_excel("Moderation.xlsx")

#merge dataframes 1 and 2
df = df1.merge(df2, on = 'module_id', how='outer')

#export new dataframe to excel
df.to_excel('WLM module data_test4.xlsx')

This does merge the data, but what it also does is where dataframe 1 has multiple entries for a module, it creates duplicate data in the new merged file so that there are equal entries in the df2 data. Here's an example:

output

So I want to only have one entry for the moderation of the module, whereas I have two at the moment (highlighted in red).

I also want to remove the additional columns : "term_y", "semester_y", "credits_y" and "students_y" in the final output as they are just repeats of data I already have in df1.

Thanks!

why don't your merge using Unique ID? I have a feeling, this type of merge, does not give your the correct output your want — NoobVB, Apr 14 '22 at 10:45
I think you want duplicated , from here https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.duplicated.html - if a column has duplicated values using duplicated and set these to your desired result. please see my answer below. — David Wooley - AST, Apr 23 '22 at 14:18

score 0 · Answer 1 · answered Apr 14 '22 at 10:59

Could you provide a sample of desired output?

Otherwise, choosing the right type of merge should resolve your issue. Have a look at the documentation, there are the possible options and their corresponding SQL statements listed: https://pandas.pydata.org/docs/reference/api/pandas.merge.html

Regarding the additional columns you have two options:

Again from the documentation: Select the suffixes with the suffixes parameter. To add suffixes only to duplicate columns from df1, you could set them to something like suffixes=("B2", "").
Use df2 within the merge only with the columns needed in the output. E.g. df = df1.merge(df2[['module_id', 'moderator']], on = 'module_id', how='outer')

Thanks for the input, option 2 didn't work. I don't have a column called "moderator". What I'm trying to do is merge two datasets. They both contain information about a set of "modules" but it's different info and there are a different number of rows assigned to each module in each df. So what it is doing is just duplicating the data in df2 rather than just leaving the extra rows blank which is what I want. — spragglerocks, Apr 22 '22 at 14:08

David Wooley - AST · Accepted Answer · 2022-04-23T17:27:27.457

I think what you want is duplicated garnerd from

Pandas - Replace Duplicates with Nan and Keep Row & Replace duplicated values with a blank string

So what you want is this after your merge: df.loc[df['module_id'].duplicated(), 'module_id'] = pd.NA

Please read both stackoverflow link examples to understand how this works better.

So full code would look like this

import pandas as pd

#import excel files
df1 = pd.read_excel("B2 teaching.xlsx")
df2 = pd.read_excel("Moderation.xlsx")

#merge dataframes 1 and 2
df = df1.merge(df2, on = 'module_id', how='outer')

df.loc[df['module_id'].duplicated(), 'module_id'] = pd.NA

#export new dataframe to excel
df.to_excel('WLM module data_test5-working.xlsx')

Many ways to drop columns too.

Ive chosen, for lack of more time, to do this:

df.drop(df.columns[2], axis=1, inplace=True)

from https://www.stackvidhya.com/drop-column-in-pandas/

change df.columns[2] to the N'th number column you want to drop. (Since my working data was differernt to yours*)

After the merge. so that full code will look like this:

import pandas as pd

#import excel files
df1 = pd.read_excel("B2 teaching.xlsx")
df2 = pd.read_excel("Moderation.xlsx")

#merge dataframes 1 and 2
df = df1.merge(df2, on = 'module_id', how='outer')

df.loc[df['module_id'].duplicated(), 'module_id'] = pd.NA

#https://www.stackvidhya.com/drop-column-in-pandas/

#export new dataframe to excel
df.to_excel('WLM module data_test6-working.xlsx') 

df.drop(df.columns[2], axis=1, inplace=True)

Hope ive helped.

I'm just very happy I got you somwhere/did this. For both of our sakess!

Happy you have a working answer.

& if you want to create a new df out of the merged, duplicated and droped columns df, you can do this:

new = df.drop(df.iloc[: , [1, 2, 7]], axis=1)

from Extracting specific selected columns to new DataFrame as a copy

*So that full code * would look something like this (please adjust column numbers as your need) which is what I wanted:

# Import pandas library
import pandas as pd

#import excel files
df1 = pd.read_excel("B2 teaching.xlsx")
df2 = pd.read_excel("Moderation.xlsx")

#merge dataframes 1 and 2
df = df1.merge(df2, on = 'module_id', how='outer')
df.loc[df['module_id'].duplicated(), 'module_id'] = pd.NA

new = df.drop(df.iloc[: , [1, 2, 7]], axis=1)

#new=pd.DataFrame(df.drop(df.columns[2], axis=1, inplace=True))

print(new)
#export new dataframe to excel
df.to_excel('WLM module data_test12.xlsx')
new.to_excel('WLM module data_test13.xlsx')

Note: *When I did mine above , I deliberately didn't have any headers In columns, to try make it generic as possible. So used iloc to specify colum Number Initially. ( Since your original question was not that descriptive or clear, but kind got the point.). Think you should include copyable draft data (not screen shots) next time to make it easier for people/entice insentivise experts on here to enagage with the post. Plus more clearer Why's & How's. & SO isnt a free code writing servce, you know, but it was to my benefit also (hugely) to do/delve into this.

And, if it answers your question, initially, could you be so kind to accept the answer? — David Wooley - AST, Apr 23 '22 at 14:20
and for dropping columns, this , initially, is a better quicker resource: https://www.geeksforgeeks.org/how-to-drop-one-or-multiple-columns-in-pandas-dataframe/ but both are good. — David Wooley - AST, Apr 23 '22 at 14:33
or this so thread - https://stackoverflow.com/questions/13411544/delete-a-column-from-a-pandas-dataframe — David Wooley - AST, Apr 23 '22 at 14:44
Thanks - I can use what you've written to get what I want! So I now have it working :-) appreciate the help. — spragglerocks, Apr 25 '22 at 09:01
@spragglerocks [- StackOverflow : What should I do when someone answers my question?](https://stackoverflow.com/help/someone-answers) . I'd be grateful if you can mark the check/tick. Accept the answer. But It was fascinating for me to do anyway. — David Wooley - AST, Apr 25 '22 at 13:28

David Wooley - AST · Answer 3 · 2022-04-24T18:17:10.140

& further to the 3 successful and working codes below, each one answering a part of your queston, you could do the whole thing by/using iloc, which is what I prefer .

# Import pandas library
import pandas as pd

#import excel files
df1 = pd.read_excel("B2 teaching.xlsx")
df2 = pd.read_excel("Moderation.xlsx")

#merge dataframes 1 and 2
df = df1.merge(df2, on = 'module_id', how='outer')

#do/apply duplicated() on one of the columns. (see more about duplicated in my post below)
df.loc[df[df.iloc[:0,8].name].duplicated(), 'module_id'] = pd.NA

# drop the columns you dont want and save to new df/create a new sheet w
new = df.drop(df.iloc[: , [11, 13, 14, 15]], axis=1)

#new=pd.DataFrame(df.drop(df.columns[2], axis=1, inplace=True))

print(new) 

#export new dataframe to excel
df.to_excel('WLM module data_test82.xlsx')
new.to_excel('WLM module data_test83.xlsx') #<--this is your data after dropping the columns. Wanted to do it sepertely so you can see what is happeningh/know how to. The default is just to change/modify the old df.

print(df.iloc[:0,8].name) # to show you how to get the header name from iloc

print(df.iloc[:0,8]) # to shoe you what iloc gives on its own

: "term_y", "semester_y", "credits_y" and "students_y" are 12, 14, 15 & 16 are the columns you want to remove , so ive done that here.

iloc starts from 0. so new = df.drop(df.iloc[: , [11, 13, 14, 15]], axis=1)

so, like in the 3rd piece of code before, does what you wanted. All you have to do is change the column numbers it refers to (if youd given us a non snapshot picture, and dummy text replicating you use case, instead we would have copied that to work with, instead of having no time and writing out outselves to do it). Post Edit 14:48 24/04/22 - Just done that here for you. Just copy the code and run.

you have Module (col 3), Module_Id (col 4) and module name (col 13) in your data [in my dummy data, that was column 9 (iloc 8). as said, didnt have time to replicate perfectly, just the idea) but I think its module_id column (column 9, iloc 8) you are wanting to : not just to merge on, but also then do .duplicated() by on. so you can run code as is , if thats the case.

If its not, just change df.loc[df[df.iloc[:0,8].name].duplicated(), 'module_id'] = pd.NA from number 8, to 2, 3 or 12 for your use-case/columns.

I think I prefer this answer, for fact knowing/counting the number of columns frees you up from having to call it by name , and allows for a different type of automation. You can still implement contains or find regex to locate and work woth columns data later on , but this is another method with its own power over having to rely on names. more precise i feel.

Literally plug this code and run, and play & let me know how it goes. All work for me.

score 0 · Answer 4 · answered Apr 26 '22 at 14:39

Thanks everyone for your help, this is my final code which seems to work:

# Import pandas library
import pandas as pd

#import excel files
df1 = pd.read_excel("B2 teaching.xlsx")
df2 = pd.read_excel("moderation.xlsx")

#merge dataframes 1 and 2
df = df1.merge(df2, on = 'module_id', how='outer')

#drop columns not needed
df.drop('term_y', inplace=True, axis=1)
df.drop('semester_y', inplace=True, axis=1)
df.drop('credits_y', inplace=True, axis=1)
df.drop('n_students_y', inplace=True, axis=1)

#drop duplicated rows    
df.loc[df['module_name'].duplicated(), 'module_name'] = pd.NA
df.loc[df['moderation_wl'].duplicated(), 'moderation_wl'] = pd.NA

#export new dataframe to excel
df.to_excel('output.xlsx')

merging two excel files and then removing duplicates that it creates

4 Answers4