How to compare 2 CSV files in python value by value and print the difference?

Question

I have 2 CSV files of same dimensions. In the below example used the dimensions is 3*3 (3 comma separated values and 3 rows). It could be files of dimensions 100*10000

File1.csv:

Name, ID, Profession

Tom, 1, Teacher

Dick, 2, Actor

File2.csv:

Name, ID, Profession

Dick, 2, Actor

Tom, 1, Police

I want to compare the files element wise (e.g: Teacher == Police)

It would be great if I could compare the lists using primary key (ID) in case the list is not in order. I would like to have output something like below:

Profession of ID = 1 does not match, i.e Teacher <> Police

ID in the output above is the primary key.

Note: file may be very huge (100 columns * 10000 records)

Below is the code I used to get the lists A and B from 2 csv files. But it's very tedious and I could get only 2 lines using such long code.

source_file = open('File1.csv', 'r')
file_one_line_1 = source_file.readline()
file_one_line_1_str = str(file_one_line_1)
file_one_line_1_str_replace = file_one_line_1_str.replace('\n', '')
file_one_line_1_list = list(file_one_line_1_str_replace.split(','))
file_one_line_2 = source_file.readline()
file_one_line_2_str = str(file_one_line_2)
file_one_line_2_str_replace = file_one_line_2_str.replace('\n', '')
file_one_line_2_list = list(file_one_line_2_str_replace.split(','))
file_one_line_3 = source_file.readline()
file_one_line_3_str = str(file_one_line_3)
file_one_line_3_str_replace = file_one_line_3_str.replace('\n', '')
file_one_line_3_list = list(file_one_line_3_str_replace.split(','))
A = [file_one_line_1_list, file_one_line_2_list, file_one_line_3_list]


target_file = open('File2.csv', 'r')
file_two_line_1 = target_file.readline()
file_two_line_1_str = str(file_two_line_1)
file_two_line_1_str_replace = file_two_line_1_str.replace('\n', '')
file_two_line_1_list = list(file_two_line_1_str_replace.split(','))
file_two_line_2 = source_file.readline()
file_two_line_2_str = str(file_two_line_2)
file_two_line_2_str_replace = file_two_line_2_str.replace('\n', '')
file_two_line_2_list = list(file_two_line_2_str_replace.split(','))
file_two_line_3 = source_file.readline()
file_two_line_3_str = str(file_two_line_3)
file_two_line_3_str_replace = file_two_line_3_str.replace('\n', '')
file_two_line_3_list = list(file_two_line_3_str_replace.split(','))
B = [file_two_line_1_list, file_two_line_2_list, file_two_line_3_list]

Used below code and it's working smooth:


source_file = 'Book1.csv'

target_file = 'Book2.csv'

primary_key = 'id'

# read source and target files
with open(source_file, 'r') as f:
    reader = csv.reader(f)
    A = list(reader)
with open(target_file, 'r') as f:
    reader = csv.reader(f)
    B = list(reader)

# get the number of the 'ID' column
column_names = A[0]
column_id = column_names.index(primary_key)

# get the column names without 'ID'
values_name = column_names[0:column_id] + column_names[column_id + 1:]

# create a dictionary with keys in column `column_id`
# and values the list of the other column values
A_dict = {a[column_id]: a[0:column_id] + a[column_id + 1:] for a in A}
B_dict = {b[column_id]: b[0:column_id] + b[column_id + 1:] for b in B}

# iterate on the keys and on the other columns and print the differences
for id in A_dict.keys():
    for column in range(len(column_names) - 1):
        if A_dict[id][column] != B_dict[id][column]:
            print(f"{primary_key} = {id}\t{values_name[column]}: {A_dict[id][column]} != {B_dict[id][column]}")```

Thanks.

Raphael · Answer 1 · 2020-01-09T13:32:47.670

0

For reading csv and store the content as nested lists, see https://stackoverflow.com/a/35340988/12669658

For comparing the lists element-wise, refer to your dedicated question: https://stackoverflow.com/a/59633822/12669658

edited Jan 09 '20 at 13:32

answered Jan 08 '20 at 12:16

Raphael

47
5

The 1st code is returning error: line 8, in A.append(next_line_list) MemoryError - - - - - - - - - - - - 2nd code is returning error: line 5, in for next_line in readlines(source_file): NameError: name 'readlines' is not defined – Thoufeeque Jan 08 '20 at 19:10
Yeah my bad I was in a hurry and posted an untested code. I hoped to help you see how to fix and improve yours, but I removed it since I lack the time to do it properly – Raphael Jan 09 '20 at 13:36

How to compare 2 CSV files in python value by value and print the difference?

1 Answers1

Linked