1

I have two large text files (200,000+ lines), CSV format. I need to compare them, line by line, but the fields maybe switched within each line.

Example file A.csv:

AAA,BBB,,DDD  
EEE,,GGG,HHH  
III,JJJ,KKK,LLL

Example file B.csv:

AAA,,BBB,DDD  
EEE,,GGG,HHH  
LLL,KKK,JJJ,III

So for my purposes, A.csv and B.csv should be "identical" even though fields are switch in the first and last line. Since the fields in each file might be in a different order, the usual options like grep or diff won't work.

Basically, I think I need to write something that reads a line of A.csv and B.csv, and checks if all fields are present in both lines, independent of the order. Alternatively, something that orders the fields after reading the lines.

Remi Guan
  • 21,506
  • 17
  • 64
  • 87

2 Answers2

6

You can normalize the check, without affecting the data.

with open('big1.csv') as i, open('big2.csv') as j:
   a = csv.reader(i)
   b = csv.reader(j)
   for linea in a:
      lineb = next(b)
      if sorted(map(str.lower, linea)) != sorted(map(str.lower, lineb)):
          print('{} does not match {}'.format(linea, lineb))
Burhan Khalid
  • 169,990
  • 18
  • 245
  • 284
-1

Try using diff as a Linux/Unix command line - this is very useful for comparing files.

Falko
  • 17,076
  • 13
  • 60
  • 105
Tim Seed
  • 5,119
  • 2
  • 30
  • 26