4

I have been using the Python difflib library to find where 2 documents differ. The Differ().compare() method does this, but it is very slow - atleast 100x slower for large HTML documents compared to the diff command.

How can I efficiently determine where 2 documents differ in Python? (Ideally I am after the positions rather the actual text, which is what SequenceMatcher().get_opcodes() returns.)

hoju
  • 28,392
  • 37
  • 134
  • 178

3 Answers3

3
a = open("file1.txt").readlines()
b = open("file2.txt").readlines()
count = 0
pos = 0

while 1:
    count += 1
    try:
        al = a.pop(0)
        bl = b.pop(0)
        if al != bl:
            print "files differ on line %d, byte %d" % (count,pos)
        pos += len(al)
    except IndexError:
        break
Kimvais
  • 38,306
  • 16
  • 108
  • 142
  • That is a good idea to compare by lines instead of characters, which is what I was doing. When I changed Differ to use lines instead of characters the efficiency became comparable to the diff command! – hoju Jan 05 '10 at 09:46
2

Google has a diff library for plain text with a python API, which should apply to the html documents you want to work with. I am not sure if it is suited for your particular use case where you are specifically interested in the location of the differences, but it is worth having a look at.

Raja Selvaraj
  • 7,178
  • 3
  • 20
  • 13
  • Looks promising, but their wiki page does warn that "The diff, match and patch algorithms in this library are plain text only. Attempting to feed HTML, XML or some other structured content through them may result in problems." – hoju Jan 04 '10 at 22:07
  • how can i use this API with Python>? Would be great if it could be illustrated with an example – qre0ct May 15 '13 at 04:45
1

An ugly and stupid solution: If diff is faster, use it; through a call from python via subprocess, parse the command output for the information you need. This won't be as fast as just diff, but maybe faster than difflib.

miku
  • 181,842
  • 47
  • 306
  • 310