Comparing two spreadsheets, removing the duplicates and exporting the result to a csv in python

Question

I'm trying to compare two excel spreadsheets, remove the names that appear in both spreadsheets from the first spreadsheet and then export it to a csv file using python. I am new, but here's what I have so far:

import pandas as pd
data_1 = pd.read_excel (r'names1.xlsx')
bit_data = pd.DataFrame(data_1, columns= ['Full_Name'])
bit_col = len(bit_data)
data_2 = pd.read_excel (r'force_names.xlsx')

force_data = pd.DataFrame(data_2, columns= ['FullName'])

force_col = len(force_data)


for bit_num in range(bit_col):

  for force_num in range(force_col):

        if bit_data.iloc[bit_num,0] == force_data.iloc[force_num,0]:

            data_1 = data_1.drop(data_1.index[[bit_num]])

            data_1.to_csv(r"/Users/name/Desktop/Reports/Names.csv")

When I run it it gets rid of some duplicates but not all, any advice anyone has would be greatly appreciated.

Have a look at [How to make good pandas examples](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) and [edit] your question to include a [mcve] showing sample input dataframes, current output, and expected output — G. Anderson, Jul 01 '21 at 18:17
Also note that pandas has builtin `merge()`, `isin()`, and `drop_duplicates()` functions that could be helpful here — G. Anderson, Jul 01 '21 at 18:18

score 0 · Answer 1 · answered Jul 03 '21 at 01:36

Use pandas merge to get all unique names, with no duplicates. If you want to drop any names that are in both files (I'm not sure if that's what you're asking), you can do so. See this toy example:

row1list = ['G. Anderson']
row2list = ['Z. Ebra']
df1 = pd.DataFrame([row1list, row2list], columns=['FullName'])

row1list = ['G. Anderson']
row2list = ['C. Obra']
df2 = pd.DataFrame([row1list, row2list], columns=['FullName'])

df3 = df1.merge(df2, on='FullName', how='outer', indicator=True)

print(df3)
#       FullName      _merge
# 0  G. Anderson        both
# 1      Z. Ebra   left_only
# 2      C. Obra  right_only

df3 = df3.loc[df3['_merge'] != 'both']
del df3['_merge']

print(df3)
#   FullName
# 1  Z. Ebra
# 2  C. Obra

Comparing two spreadsheets, removing the duplicates and exporting the result to a csv in python

1 Answers1