This is an addition to my previous question (Compare lines in 2 text files).
Consider these 2 example files:
A.csv
:
AAA,BBB,CCC
DDD,,EEE
GGG,HHH,III
B.csv
:
AAA,,BBB,CCC
EEE,,DDD,,
,,GGG,III,HHH
I want these to be identical, even though they have different field orders and number of columns.
This is what I have so far:
#!/usr/bin/python
import sys
import csv
f1 = sys.argv[1]
f2 = sys.argv[2]
with open(f1) as i, open(f2) as j:
a = csv.reader(i)
b = csv.reader(j)
for linea in a:
lineb = next(b)
if sorted(map(str.lower, linea)) != sorted(map(str.lower, lineb)):
print('{} does not match {}'.format(linea, lineb))
Update:
Here is what I ended up with (thanks @keksnicoh):
#!/usr/bin/python
import sys
import csv
f1 = sys.argv[1]
f2 = sys.argv[2]
with open(f1) as i, open(f2) as j:
a = csv.reader(i)
b = csv.reader(j)
for linea in a:
lineb = next(b)
seta = set([x for x in linea if len(x) > 0])
setb = set([x for x in lineb if len(x) > 0])
if (seta != setb):
print('Line {} does not match: {}'.format(a.line_num, seta ^ setb))
The issue I face now is: how to deal with duplicates, for example:
Example file A.csv
:
1,2,,
1,2,2,3,4
Example file B.csv
:
1,2,2,2
1,2,3,4
The script above considers the files to be identical, but they are not. From searching Stackoverflow, it seems that I cannot use a set but have to use a list. But then I lose the advantage of using sets, which is no having to worry about the order of fields.
How can I modify my code to consider duplicate entries as well?