Comparing multiple columns of two csv files and save output as matching/not matching in new csv file

Question

Suppose i have Columns in file1.csv as

Customer id    Name 

Q1             Alen
W2             Ricky
E3             Katrina
R4             Anya
T5             Leonardo

and Columns in file2.csv as

Customer id    Name

Q1             Alen
W2             Harry
E3             Katrina
R4             Anya
T5             Leonard

here as you can see for Customer id: W2 the corresponding name is not matching. so the output.csv should be like below:

Customer id  Status

Q1           Matching
W2           Not matching
E3           Matching
R4           Matching
T5           Matching

How can i get the above output using python.

P.S. whats the code for comparing multiple columns, not just column Name?

My code

import csv
with open('file1.csv', 'rt', encoding='utf-8') as csvfile1:
    csvfile1_indices = dict((r[1], i) for i, r in enumerate(csv.reader(csvfile1)))

with open('file2.csv', 'rt', encoding='utf-8') as csvfile2:
    with open('output.csv', 'w') as results:    
        reader = csv.reader(csvfile2)
        writer = csv.writer(results)

        writer.writerow(next(reader, []) + ['status'])

        for row in reader:
            index = csvfile1_indices.get(row[1])
            if index is not None:
                message = 'matching'
                writer.writerow(row + [message])

            else:
                 message = 'not matching'
                 writer.writerow(row + [message])

    results.close()

This is working fine, but can i write in any other easier way to get the same output? and what changes do i need to make to compare multiple columns?

What have you tried so far? How about just use string comparison tool like winmerge? — Circle Hsiao, Nov 05 '18 at 09:50
similar question [here](https://stackoverflow.com/questions/48693547/comparing-two-csv-files-and-get-the-difference-using-python) and [here](https://stackoverflow.com/questions/41967523/trying-to-compare-two-csv-files-and-write-differences-as-output) and [here](https://stackoverflow.com/questions/38996033/python-compare-two-csv-files-and-print-out-differences) — sunny chidi, Nov 05 '18 at 10:05
@蕭為元 You can see the code i tried. I've edited the question — pytorch, Nov 05 '18 at 10:08

Siliam · Accepted Answer · 2018-11-06T09:41:13.603

2

If you don't mind using Pandas, you can do it in 5 lines of code :

import pandas as pd 

# assuming id columns are identical and contain the same values
df1 = pd.read_csv('file1.csv', index_col='Customer_id')
df2 = pd.read_csv('file2.csv', index_col='Customer_id')

df3 = pd.DataFrame(columns=['status'], index=df1.index)
df3['status'] = (df1['Name'] == df2['Name']).replace([True, False], ['Matching', 'Not Matching'])

df3.to_csv('output.csv')

Edit : removed sep = '\t' to use default comma separator.

edited Nov 06 '18 at 09:41

answered Nov 05 '18 at 10:16

Siliam

66
3

1

I got ValueError: Index Customer_id invalid. but changing sep='\t' to sep=',' solved the error – pytorch Nov 05 '18 at 11:12
Sorry my bad ! you can actually omit the separator argument altogether if you're using comma-separated values/ – Siliam Nov 06 '18 at 09:42

score 0 · Answer 2 · answered Nov 05 '18 at 09:49

0

Read both csv files into two different dictionaries and iterate over any of the dictionary and check for the same key in other. If you want order use OrderedDict

answered Nov 05 '18 at 09:49

Sanjay Idpuganti

312
5
11

python script ? @Sanjay Idpuganti – pytorch Nov 05 '18 at 10:02

Alex · Answer 3 · 2018-11-05T11:40:23.210

You can merge on multiple columns:

f1
  Customer_id      Name
0          Q1      Alen
1          W2     Ricky
2          E3   Katrina
3          R4      Anya
4          T5  Leonardo

f2
  Customer_id      Name
0          Q1      Alen
1          W2     Harry
2          E3   Katrina
3          R4      Anya
4          T5  Leonardo

m = f1.merge(f2, on=['Customer_id', 'Name'], indicator='Status', how='outer')
  Customer_id      Name      Status
0          Q1      Alen        both
1          W2     Ricky   left_only
2          E3   Katrina        both
3          R4      Anya        both
4          T5  Leonardo        both
5          W2     Harry  right_only

m['Status'] = m['Status'].map({'both': 'Matching', 
                               'left_only': 'Not matching', 
                               'right_only': 'Not matching'})

m.drop_duplicates(subset=['Customer_id', 'Status'])
m.drop(['Name'], axis=1)
  Customer_id        Status
0          Q1      Matching
1          W2  Not matching
2          E3      Matching
3          R4      Matching
4          T5      Matching

@ Alex, this will not result into the desired output , otherwise its easy :) — Karn Kumar, Nov 05 '18 at 10:37
@pytorch I have updated the code a little bit to make it shorter/easier to maintain. — Alex, Nov 05 '18 at 11:19

Comparing multiple columns of two csv files and save output as matching/not matching in new csv file

3 Answers3