-1

i have below python code to compare 2 CSV file rows, and match each column field and display the difference. However the output is not in order, Please help to improve code output.

(I googled and found a python package csvdiff but it requires to specify column number.)

2 CSV files:

cat file1.csv
1,2,2222,3333,4444,3,

cat file2.csv
1,2,5555,6666,7777,3,

My Python3 code:

with open('file1.csv', 'r') as t1, open('file2.csv', 'r') as t2:
    filecoming = t1.readlines()
    filevalidation = t2.readlines()

for i in range(0,len(filevalidation)):
    coming_set = set(filecoming[i].replace("\n","").split(","))
    validation_set = set(filevalidation[i].replace("\n","").split(","))
    ReceivedDataList=list(validation_set.intersection(coming_set))
    NotReceivedDataList=list(coming_set.union(validation_set)- 
    coming_set.intersection(validation_set))
    print(NotReceivedDataList)

output:

['6666', '5555', '3333', '2222', '4444', '7777']

Even though it is printing the differences from both files, the output is not in order. (3 differences from file2, and 3 differences from file1)

i am trying the produce the column wise results i.e., with each difference in file1 to corresponding difference in file2.

somethinglike

2222  - 5555
3333  - 6666
4444  - 7777

Please help,,

Thanks in advance.

Sheldon
  • 169
  • 1
  • 2
  • 16
  • did you check https://stackoverflow.com/questions/38996033/python-compare-two-csv-files-and-print-out-differences – Mohsen Mar 30 '20 at 19:22

1 Answers1

0

Try this:

import pandas
with open('old.csv', 'r') as t1, open('new.csv', 'r') as t2:
    filecoming = t1.readlines()
    filevalidation = t2.readlines()

for i in range(0,len(filevalidation)):
    coming_set = set(filecoming[i].replace("\n","").split(","))
    validation_set = set(filevalidation[i].replace("\n","").split(","))
    ReceivedDataList=list(validation_set.intersection(coming_set))
    NotReceivedDataList=list(coming_set.union(validation_set)-coming_set.intersection(validation_set))
    print(NotReceivedDataList)

old=[]
new=[]
for items in NotReceivedDataList:
    if items in coming_set:
        old.append(items)

    elif items in validation_set:
        new.append(items)
print(old)
print(new)

Output:

['2222', '5555', '6666', '3333', '4444', '7777']
['2222', '3333', '4444']
['5555', '6666', '7777']

Addition: this my help you more lets have old and new from CSV file, then [item for item in old if item not in new] would give you items that are not in new. Also with help of enumerate we can identify with column is different(differences are in column 2,3 and 4)

old=[1,2,2222,3333,4444,3]
new=[1,2,5555,6666,7777,3]

print([item for item in old if item not in new])
print([item for item in new if item not in old])

for index, (first, second) in enumerate(zip(old, new)):
    if first != second:
        print(index, first ,second)

Output:

[2222, 3333, 4444]
[5555, 6666, 7777]
2 2222 5555
3 3333 6666
4 4444 7777
Mohsen
  • 1,079
  • 2
  • 8
  • 23
  • thanks a lot Mohsen.. Second option is the one that suits my requirement. – Sheldon Mar 31 '20 at 13:27
  • Aside from the very last loop with `enumerate`, most of this answer inherits bugs from the question's code, such as not actually checking differences by column - the difference between `2,1` and `1,2` would not be detected. – user2357112 Apr 01 '20 at 10:53