Counting the number of character differences between two files

Question

I have two somewhat large (~20 MB) txt files which are essentially just long strings of integers (only either 0,1,2). I would like to write a python script which iterates through the files and compares them integer by integer. At the end of the day I want the number of integers that are different and the total number of integers in the files (they should be exactly the same length). I have done some searching and it seems like difflib may be useful but I am fairly new to python and I am not sure if anything in difflib will count the differences or the number of entries.

Any help would be greatly appreciated! What I am trying right now is the following but it only looks at one entry and then terminates and I don't understand why.

f1 = open("file1.txt", "r")
f2 = open("file2.txt", "r")
fileOne = f1.readlines()
fileTwo = f2.readlines()
f1.close()
f2.close()

correct = 0
x = 0
total = 0
for i in fileOne:
  if i != fileTwo[x]:
    correct +=1
  x += 1
  total +=1

if total != 0:
  percent = (correct / total) * 100
  print "The file is %.1f %% correct!" % (percent)
  print "%i out of %i symbols were correct!" % (correct, total)

When you say it terminates, do you mean it prints an error message? If it does, what error message does it print? — huu, May 30 '14 at 17:31
This is because you are iterating over lines, while you should iterate over characters: http://stackoverflow.com/questions/2988211/how-to-read-a-single-character-at-a-time-from-a-file-in-python — Djizeus, May 30 '14 at 17:51

dawg · Answer 1 · 2014-05-30T18:04:21.363

0

Not tested at all, but look at this as something a lot easier (and more Pythonic):

from itertools import izip

with open("file1.txt", "r") as f1, open("file2.txt", "r") as f2:
    data=[(1, x==y) for x, y in izip(f1.read(), f2.read())]

print sum(1.0 for t in data if t[1]) / len(data) * 100

edited May 30 '14 at 18:04

answered May 30 '14 at 17:41

dawg

98,345
23
131
206

Padraic Cunningham · Answer 2 · 2014-05-30T18:43:52.753

You can use enumerate to check the chars in your strings that don't match

If all strings are guaranteed to be the same length:

with open("file1.txt","r") as f:
    l1 = f.readlines()
with open("file2.txt","r") as f:
    l2 = f.readlines()


non_matches = 0. 
total = 0.
for i,j in enumerate(l1):
    non_matches += sum([1 for k,l in enumerate(j) if l2[i][k]!= l]) # add 1 for each non match
    total += len(j.split(","))
print non_matches,total*2
print non_matches / (total * 2) * 100.   # if strings are all same length just mult total by 2

6 40
15.0

Counting the number of character differences between two files

2 Answers2