How could I compare the results of two csv files that only contains numbers?

Question

I have two csv files with 200 columns each. The two files have the exact same numbers in rows and columns. I want to compare each columns separately.

The idea would be to compare column 1 value of file "a" to column 1 value of file "b" and check the difference and so on for all the numbers in the column (there are 100 rows) and write out a number that in how many cases were the difference more than 3.

I would like to repeat the same for all the columns.

import pandas as pd
df=pd.read_csv('a.csv') 
de=pd.read_csv('b.csv')

Does this help https://stackoverflow.com/questions/4700441/python-comparing-specific-columns-in-two-csv-files — Aymen, Nov 16 '22 at 11:06

score 0 · Answer 1 · answered Nov 16 '22 at 15:47

I came up with something and I hope it helps you:

# file1.csv:
#
# 1;1;1
# 3;3;3
# 5;5;5
# 7;7;7
#
# files2.csv:
#
# 2;2;2
# 4;3;4
# 6;5;6
# 8;8;8

import csv

# change this to 200 for your file
columns_num = 3

# a dictionary that will hold our columns and the number of differences
diff = {}  # {column_index: number_of_differences}

with open("file1.csv", 'r') as file1, open("file2.csv", 'r') as file2:
    reader1, reader2 = csv.reader(file1), csv.reader(file2)

    # here we go line by line
    for line1, line2 in zip(reader1, reader2):
        # your delimiter may not be the same
        line1 = line1[0].split(";")  # output : [1, 1, 1]
        line2 = line2[0].split(";")
        
        # here we go column by column
        for i in range(0, columns_num):
            if line1[i] != line2[i]:
                try:
                    # if the column exist, we increment its value
                    diff[i] += 1
                except KeyError:
                    # if the column doesn't exist, we add it
                    diff[i] = 1

print(diff)

# Output:
# {0: 4, 1: 2, 2: 4}

diff_new = {i: diff[i] for i in diff if diff[i] > 3}
print(diff_new)

# Output:
# {0: 4, 2: 4}

score 0 · Answer 2 · answered Nov 16 '22 at 16:00

import pandas as pd
import numpy as np

file1 = pd.read_csv("file1.csv", header=None)
file2 = pd.read_csv("file2.csv", header=None)

diff_mask = file1 != file2

# count per column where difference is more than three
diff_more_3 = np.sum(diff_mask, axis=0) > 3

# Get the number of columns where that is the case.
print(sum(diff_more_3))

diff_more_3 contains a boolean flag for each column if there are three or more differences. If you just want the number you just sum them up.

How could I compare the results of two csv files that only contains numbers?

2 Answers2