1

This is an addition to my previous question (Compare lines in 2 text files).

Consider these 2 example files:

A.csv:

AAA,BBB,CCC  
DDD,,EEE  
GGG,HHH,III

B.csv:

AAA,,BBB,CCC  
EEE,,DDD,,  
,,GGG,III,HHH

I want these to be identical, even though they have different field orders and number of columns.

This is what I have so far:

#!/usr/bin/python
import sys
import csv

f1 = sys.argv[1]
f2 = sys.argv[2]

with open(f1) as i, open(f2) as j:
    a = csv.reader(i)
    b = csv.reader(j)
    for linea in a:
        lineb = next(b)
        if sorted(map(str.lower, linea)) != sorted(map(str.lower, lineb)):
            print('{} does not match {}'.format(linea, lineb))

Update:

Here is what I ended up with (thanks @keksnicoh):

#!/usr/bin/python
import sys
import csv

f1 = sys.argv[1]
f2 = sys.argv[2]

with open(f1) as i, open(f2) as j:
    a = csv.reader(i)
    b = csv.reader(j)
    for linea in a:
        lineb = next(b)
        seta = set([x for x in linea if len(x) > 0])
        setb = set([x for x in lineb if len(x) > 0])
        if (seta != setb):
            print('Line {} does not match: {}'.format(a.line_num, seta ^ setb))

The issue I face now is: how to deal with duplicates, for example:

Example file A.csv:

1,2,,
1,2,2,3,4

Example file B.csv:

1,2,2,2
1,2,3,4

The script above considers the files to be identical, but they are not. From searching Stackoverflow, it seems that I cannot use a set but have to use a list. But then I lose the advantage of using sets, which is no having to worry about the order of fields.

How can I modify my code to consider duplicate entries as well?

Community
  • 1
  • 1

1 Answers1

0

You could map the lines to a set and filter the empty strings. Now calculate the symmetric difference of those sets and check the length of that new set.

#!/usr/bin/python
import sys
import csv

f1 = sys.argv[1]
f2 = sys.argv[2]

with open(f1) as i, open(f2) as j:
    a = csv.reader(i)
    b = csv.reader(j)
    for linea in a:
        lineb = next(b)
        seta = set([x for x in linea if len(x) > 0])
        setb = set([x for x in lineb if len(x) > 0])
        print(len(seta^setb)==0)

Also you can write this more compact

for seta in (set([x for x in l if len(x) > 0]) for l in a):
    setb = set([x for x in next(b) if len(x) > 0])
    print(len(seta^setb)==0)

UPDATE

to keep things easy, one can of course check for

seta==setb

sorry for confusion...

Nicolas Heimann
  • 2,561
  • 16
  • 27
  • Thanks for your code, it works, but I am not quite sure how it works. In the upper code, `seta` and `setb` contain all fields that are non-empty, correct? Just as one long continuous string without separators? But how does `print(len(seta^setb)==0)` work? – Markus Heller Jan 12 '16 at 22:42
  • seta and setb are sets. This means that they can only contain distinct elements. The symmetric difference (here the ^ operator) creates a third set that contains only elements **that are not in both seta and setb**. It is like "xor" operation. Thus the case that seta^setb is empty (length==0) means that they share all its elements so they are equal. Note that sets does not care about order! May you are interested in this article: [wikipedia about symmetric difference](https://en.wikipedia.org/wiki/Symmetric_difference) – Nicolas Heimann Jan 12 '16 at 22:52
  • I just realized something... Sorry for confusion, of course you could check for equality of seta and setb instead of using symmetric difference. Sometimes one thinks complex first and then easy ;) But just as a hint, the symmetric difference is a quiet useful operation. (What elements differ and so on). – Nicolas Heimann Jan 12 '16 at 22:59
  • The words in parentheses are intriguing: how could I see which elements are different? I'm not very fluent in python. – Markus Heller Jan 12 '16 at 23:01
  • [x for x in linea if len(x) > 0] creates a list where all entries have a length greater than 0. Read this like "put x for each x in linea into a new list as long it has a length greater than 0". – Nicolas Heimann Jan 12 '16 at 23:09