Determine where documents differ with Python

Question

I have been using the Python difflib library to find where 2 documents differ. The Differ().compare() method does this, but it is very slow - atleast 100x slower for large HTML documents compared to the diff command.

How can I efficiently determine where 2 documents differ in Python? (Ideally I am after the positions rather the actual text, which is what SequenceMatcher().get_opcodes() returns.)

Kimvais · Accepted Answer · 2010-01-05T08:50:02.683

3

a = open("file1.txt").readlines()
b = open("file2.txt").readlines()
count = 0
pos = 0

while 1:
    count += 1
    try:
        al = a.pop(0)
        bl = b.pop(0)
        if al != bl:
            print "files differ on line %d, byte %d" % (count,pos)
        pos += len(al)
    except IndexError:
        break

edited Jan 05 '10 at 08:50

answered Jan 04 '10 at 12:30

Kimvais

38,306
16
108
142

That is a good idea to compare by lines instead of characters, which is what I was doing. When I changed Differ to use lines instead of characters the efficiency became comparable to the diff command! – hoju Jan 05 '10 at 09:46

score 2 · Answer 2 · answered Jan 04 '10 at 13:13

2

Google has a diff library for plain text with a python API, which should apply to the html documents you want to work with. I am not sure if it is suited for your particular use case where you are specifically interested in the location of the differences, but it is worth having a look at.

answered Jan 04 '10 at 13:13

Raja Selvaraj

7,178
3
20
13

Looks promising, but their wiki page does warn that "The diff, match and patch algorithms in this library are plain text only. Attempting to feed HTML, XML or some other structured content through them may result in problems." – hoju Jan 04 '10 at 22:07
how can i use this API with Python>? Would be great if it could be illustrated with an example – qre0ct May 15 '13 at 04:45

score 1 · Answer 3 · answered Jan 04 '10 at 12:18

1

An ugly and stupid solution: If diff is faster, use it; through a call from python via subprocess, parse the command output for the information you need. This won't be as fast as just diff, but maybe faster than difflib.

answered Jan 04 '10 at 12:18

miku

181,842
47
306
310

Determine where documents differ with Python

3 Answers3