0

Suppose i have Columns in file1.csv as

Customer id    Name 

Q1             Alen
W2             Ricky
E3             Katrina
R4             Anya
T5             Leonardo

and Columns in file2.csv as

Customer id    Name

Q1             Alen
W2             Harry
E3             Katrina
R4             Anya
T5             Leonard

here as you can see for Customer id: W2 the corresponding name is not matching. so the output.csv should be like below:

Customer id  Status

Q1           Matching
W2           Not matching
E3           Matching
R4           Matching
T5           Matching

How can i get the above output using python.

P.S. whats the code for comparing multiple columns, not just column Name?

My code

import csv
with open('file1.csv', 'rt', encoding='utf-8') as csvfile1:
    csvfile1_indices = dict((r[1], i) for i, r in enumerate(csv.reader(csvfile1)))

with open('file2.csv', 'rt', encoding='utf-8') as csvfile2:
    with open('output.csv', 'w') as results:    
        reader = csv.reader(csvfile2)
        writer = csv.writer(results)

        writer.writerow(next(reader, []) + ['status'])

        for row in reader:
            index = csvfile1_indices.get(row[1])
            if index is not None:
                message = 'matching'
                writer.writerow(row + [message])

            else:
                 message = 'not matching'
                 writer.writerow(row + [message])

    results.close()

This is working fine, but can i write in any other easier way to get the same output? and what changes do i need to make to compare multiple columns?

Sreeram TP
  • 11,346
  • 7
  • 54
  • 108
pytorch
  • 77
  • 1
  • 9
  • What have you tried so far? How about just use string comparison tool like winmerge? – Circle Hsiao Nov 05 '18 at 09:50
  • similar question [here](https://stackoverflow.com/questions/48693547/comparing-two-csv-files-and-get-the-difference-using-python) and [here](https://stackoverflow.com/questions/41967523/trying-to-compare-two-csv-files-and-write-differences-as-output) and [here](https://stackoverflow.com/questions/38996033/python-compare-two-csv-files-and-print-out-differences) – sunny chidi Nov 05 '18 at 10:05
  • @蕭為元 You can see the code i tried. I've edited the question – pytorch Nov 05 '18 at 10:08
  • Can you use Pandas.? – Sreeram TP Nov 05 '18 at 10:32
  • @Sreeram yes, ofcourse – pytorch Nov 05 '18 at 10:47

3 Answers3

2

If you don't mind using Pandas, you can do it in 5 lines of code :

import pandas as pd 

# assuming id columns are identical and contain the same values
df1 = pd.read_csv('file1.csv', index_col='Customer_id')
df2 = pd.read_csv('file2.csv', index_col='Customer_id')

df3 = pd.DataFrame(columns=['status'], index=df1.index)
df3['status'] = (df1['Name'] == df2['Name']).replace([True, False], ['Matching', 'Not Matching'])

df3.to_csv('output.csv')

Edit : removed sep = '\t' to use default comma separator.

Siliam
  • 66
  • 3
  • 1
    I got ValueError: Index Customer_id invalid. but changing sep='\t' to sep=',' solved the error – pytorch Nov 05 '18 at 11:12
  • Sorry my bad ! you can actually omit the separator argument altogether if you're using comma-separated values/ – Siliam Nov 06 '18 at 09:42
0

Read both csv files into two different dictionaries and iterate over any of the dictionary and check for the same key in other. If you want order use OrderedDict

Sanjay Idpuganti
  • 312
  • 5
  • 11
0

You can merge on multiple columns:

f1
  Customer_id      Name
0          Q1      Alen
1          W2     Ricky
2          E3   Katrina
3          R4      Anya
4          T5  Leonardo

f2
  Customer_id      Name
0          Q1      Alen
1          W2     Harry
2          E3   Katrina
3          R4      Anya
4          T5  Leonardo

m = f1.merge(f2, on=['Customer_id', 'Name'], indicator='Status', how='outer')
  Customer_id      Name      Status
0          Q1      Alen        both
1          W2     Ricky   left_only
2          E3   Katrina        both
3          R4      Anya        both
4          T5  Leonardo        both
5          W2     Harry  right_only

m['Status'] = m['Status'].map({'both': 'Matching', 
                               'left_only': 'Not matching', 
                               'right_only': 'Not matching'})

m.drop_duplicates(subset=['Customer_id', 'Status'])
m.drop(['Name'], axis=1)
  Customer_id        Status
0          Q1      Matching
1          W2  Not matching
2          E3      Matching
3          R4      Matching
4          T5      Matching
Alex
  • 6,610
  • 3
  • 20
  • 38