0

I have two csv files with 200 columns each. The two files have the exact same numbers in rows and columns. I want to compare each columns separately.

The idea would be to compare column 1 value of file "a" to column 1 value of file "b" and check the difference and so on for all the numbers in the column (there are 100 rows) and write out a number that in how many cases were the difference more than 3.

I would like to repeat the same for all the columns. I know it should be a double for loop but idk exactly how. Probably 2 for loops but have no idea how to do that...

Thanks in advance!

import pandas as pd
dk = pd.read_csv('C:/Users/D/1_top_a.csv', sep=',', header=None)
dk = dk.dropna(how='all')
dk = dk.dropna(how='all', axis=1)
print(dk)

dl = pd.read_csv('C:/Users/D/1_top_b.csv', sep=',', header=None)
dl = dl.dropna(how='all')
dl = dl.dropna(how='all', axis=1)
print(dl)

rows=dk.shape[0]
print(rows)
for i
print(dk._get_value(0,0))
  • Look at [DataFrame.compare](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.compare.html); see also [this answer](https://stackoverflow.com/questions/48647534/python-pandas-find-difference-between-two-data-frames) – Joshua Voskamp Nov 16 '22 at 15:22
  • please read the question, none of them is helpful – helloitsmontie Nov 16 '22 at 15:30
  • Could you provide some example input and expected output? You say the solution "should be a double `for` loop" -- with `pandas` for-loops are rarely the best option. df.compare is almost certainly the key building block in the solution. – Joshua Voskamp Nov 16 '22 at 15:33
  • When you say "write out a number that in how many cases were the difference more than 3" -- if you compare `[1, 2, 3, 6, 7, 8]` with `[0, 0, 0, 0, 5, 5]` what do you expect as output? `1` (because only `6-0` was farther than 3-away from its compared same-element in the other list? or `5` (the count of all the values that were different, if more than 3 distinct values are different)? or something else? – Joshua Voskamp Nov 16 '22 at 15:36
  • exactly, I want one number as output for every compared column. – helloitsmontie Nov 16 '22 at 15:37
  • But I also want to examine other things, thats why I dont want to use a specific command but a for loop in which i can easily change the mathematical operation – helloitsmontie Nov 16 '22 at 15:37

1 Answers1

0
df1 = pd.DataFrame(dict(cola=[1,2,3,4], colb=[4,5,6,7]))
df2 = pd.DataFrame(dict(cola=[1,2,4,5], colb=[9,7,8,9]))

for label, content in df1.items():
    diff = df1[label].compare(df2[label])
    if diff.shape[0] >= 3:
        print(f'Found {diff.shape[0]} diffs in {label}')
        print(diff.to_markdown())

Out:

Found 4 diffs in colb
|    |   self |   other |
|---:|-------:|--------:|
|  0 |      4 |       9 |
|  1 |      5 |       7 |
|  2 |      6 |       8 |
|  3 |      7 |       9 |
crashMOGWAI
  • 619
  • 1
  • 5
  • 23
  • Can't decide whether to upvote (provides reproducible example, which OP didn't; provides OP-imagined solution) or downvote (iterates over columns with a for-loop; pandas usually has a better way). Good effort – Joshua Voskamp Nov 16 '22 at 15:41