2

I tried comparing two CSV files using Python code. But my code does not display all the mismatches. It will only show the first mismatch of every row. I need all the mismatches in a particular row.

Python code:

import csv, itertools
column_names = ['id','name','amount']
source_data = csv.reader(open('src.csv'))
target_data = csv.reader(open('tgt.csv'))
counter = 1
def rowElementCompare(sourceRow, targetRow):
    row_length = min(len(sourceRow), len(targetRow))
    for i in range(row_length):
        if sourceRow[i] != targetRow[i]:
            print i
            return i        
    return None
for source_row,target_row in itertools.izip(source_data,target_data):
    comparison_result = None
    comparison_result = rowElementCompare(source_row, target_row)
    #print (comparison_result)
    if comparison_result != None: #comparison_result is the column index at which the mismatch occured
        print "Mismatch in column %s on row number %d , source value %s, target value %s" % (column_names[comparison_result], counter, source_row[comparison_result], target_row[comparison_result])
    counter += 1

File 1:

id,name,amount
1,bob,20
3,eva,8
3,sarah,7
4,jeff,19
6,fred,10

File 2:

id,name,amount
1,bob,23
3,sarah,7
4,jeff,19
5,mira,81
6,fred,13

Output of my code:

Mismatch in column amount on row number 2 , source value 20, target value 23 
Mismatch in column name on row number 3 , source value eva, target value sarah 
Mismatch in column id on row number 4 , source value 3, target value 4 
Mismatch in column id on row number 5 , source value 4, target value 5      
Mismatch in column amount on row number 6 , source value 10, target value 13

Expected output:

Mismatch in column amount on row number 2 , source value 20, target value 23 
Mismatch in column name on row number 3 , source value eva, target value sarah 
Mismatch in column id on row number 4 , source value 3, target value 4 
Mismatch in column name on row number 4 , source value sarah, target value jeff 
Mismatch in column age on row number 4 , source value 7, target value 19 
Mismatch in column id on row number 5 , source value 4, target value 5 
Mismatch in column name on row number 5 , source value jeff, target value mira 
Mismatch in column age on row number 5 , source value 19, target value 81 
...
jonrsharpe
  • 115,751
  • 26
  • 228
  • 437
Gokul Krishna
  • 105
  • 2
  • 5

1 Answers1

1

The problem is that you're calling rowElementCompare just once per row. Furthermore, calling it repeatedly wouldn't help since it always starts at the beginning of the row and stops once it finds the first mismatch.

One way to fix this is to change rowElementCompare to yield its result rather than returning it. This way you can loop over all the mismatches in that row.

Here is the updated code. Changed lines are commented with # UPDATED.

import csv, itertools
column_names = ['id','name','amount']
source_data = csv.reader(open('foo1.csv'))
target_data = csv.reader(open('foo2.csv'))
counter = 1
def rowElementCompare(sourceRow, targetRow):
    row_length = min(len(sourceRow), len(targetRow))
    for i in range(row_length):
        if sourceRow[i] != targetRow[i]:
            print i
            yield i # UPDATED
    return # UPDATED
for source_row,target_row in itertools.izip(source_data,target_data):
    comparison_result = None
    for comparison_result in rowElementCompare(source_row, target_row): # UPDATED
        print "Mismatch in column %s on row number %d , source value %s, target value %s" % (column_names[comparison_result], counter, source_row[comparison_result], target_row[comparison_result])
    counter += 1

Another small suggestion for cleaning up the code: you can use enumerate to avoid having to update a counter variable manually.

for counter,(source_row,target_row) in enumerate(itertools.izip(source_data,target_data), start=1):
Community
  • 1
  • 1
Uri Granta
  • 1,814
  • 14
  • 25