0

I have two csv files with 200 columns each. The two files have the exact same numbers in rows and columns. I want to compare each columns separately.

The idea would be to compare column 1 value of file "a" to column 1 value of file "b" and check the difference and so on for all the numbers in the column (there are 100 rows) and write out a number that in how many cases were the difference more than 3.

I would like to repeat the same for all the columns.

import pandas as pd
df=pd.read_csv('a.csv') 
de=pd.read_csv('b.csv')

2 Answers2

0

I came up with something and I hope it helps you:

# file1.csv:
#
# 1;1;1
# 3;3;3
# 5;5;5
# 7;7;7
#
# files2.csv:
#
# 2;2;2
# 4;3;4
# 6;5;6
# 8;8;8

import csv

# change this to 200 for your file
columns_num = 3

# a dictionary that will hold our columns and the number of differences
diff = {}  # {column_index: number_of_differences}

with open("file1.csv", 'r') as file1, open("file2.csv", 'r') as file2:
    reader1, reader2 = csv.reader(file1), csv.reader(file2)

    # here we go line by line
    for line1, line2 in zip(reader1, reader2):
        # your delimiter may not be the same
        line1 = line1[0].split(";")  # output : [1, 1, 1]
        line2 = line2[0].split(";")
        
        # here we go column by column
        for i in range(0, columns_num):
            if line1[i] != line2[i]:
                try:
                    # if the column exist, we increment its value
                    diff[i] += 1
                except KeyError:
                    # if the column doesn't exist, we add it
                    diff[i] = 1

print(diff)

# Output:
# {0: 4, 1: 2, 2: 4}

diff_new = {i: diff[i] for i in diff if diff[i] > 3}
print(diff_new)

# Output:
# {0: 4, 2: 4}
Aymen
  • 841
  • 1
  • 12
  • 21
0
import pandas as pd
import numpy as np

file1 = pd.read_csv("file1.csv", header=None)
file2 = pd.read_csv("file2.csv", header=None)

diff_mask = file1 != file2

# count per column where difference is more than three
diff_more_3 = np.sum(diff_mask, axis=0) > 3

# Get the number of columns where that is the case.
print(sum(diff_more_3))

diff_more_3 contains a boolean flag for each column if there are three or more differences. If you just want the number you just sum them up.

Hans Bambel
  • 806
  • 1
  • 10
  • 20