HTML structure diff in Python

Question

I want to diff html files by structure and not by content. For example: b and a are identical with this diff because the structures of them are equal.

Anyone knows tool (I prefer in python) or implementation do it ?

I tried to parse the tree by beautifulSoup and then use zhang shasha algorithm but it's so slow... — Horesh, Oct 19 '15 at 03:55

score 0 · Answer 1 · answered Oct 18 '15 at 23:34

You need to parse the HTML/XMLto a DOM tree and then compare those trees. The preferred solution for parsin in Python for this is lxml library. For comparison I am not sure any lib exist but below is a guidelining source code.

Here is one XML comparison function from Ian Bicking (orignal source, under Python Software Foundation License, https://bitbucket.org/ianb/formencode/src/tip/formencode/doctest_xml_compare.py?fileviewer=file-view-default#doctest_xml_compare.py-70 )

try:
    import doctest
    doctest.OutputChecker
except AttributeError: # Python < 2.4
    import util.doctest24 as doctest
try:
    import xml.etree.ElementTree as ET
except ImportError:
    import elementtree.ElementTree as ET
from xml.parsers.expat import ExpatError as XMLParseError

RealOutputChecker = doctest.OutputChecker


def debug(*msg):
    import sys
    print >> sys.stderr, ' '.join(map(str, msg))


class HTMLOutputChecker(RealOutputChecker):

    def check_output(self, want, got, optionflags):
        normal = RealOutputChecker.check_output(self, want, got, optionflags)
        if normal or not got:
            return normal
        try:
            want_xml = make_xml(want)
        except XMLParseError:
            pass
        else:
            try:
                got_xml = make_xml(got)
            except XMLParseError:
                pass
            else:
                if xml_compare(want_xml, got_xml):
                    return True
        return False

    def output_difference(self, example, got, optionflags):
        actual = RealOutputChecker.output_difference(
            self, example, got, optionflags)
        want_xml = got_xml = None
        try:
            want_xml = make_xml(example.want)
            want_norm = make_string(want_xml)
        except XMLParseError, e:
            if example.want.startswith('<'):
                want_norm = '(bad XML: %s)' % e
                #  '<xml>%s</xml>' % example.want
            else:
                return actual
        try:
            got_xml = make_xml(got)
            got_norm = make_string(got_xml)
        except XMLParseError, e:
            if example.want.startswith('<'):
                got_norm = '(bad XML: %s)' % e
            else:
                return actual
        s = '%s\nXML Wanted: %s\nXML Got   : %s\n' % (
            actual, want_norm, got_norm)
        if got_xml and want_xml:
            result = []
            xml_compare(want_xml, got_xml, result.append)
            s += 'Difference report:\n%s\n' % '\n'.join(result)
        return s


def xml_compare(x1, x2, reporter=None):
    if x1.tag != x2.tag:
        if reporter:
            reporter('Tags do not match: %s and %s' % (x1.tag, x2.tag))
        return False
    for name, value in x1.attrib.items():
        if x2.attrib.get(name) != value:
            if reporter:
                reporter('Attributes do not match: %s=%r, %s=%r'
                         % (name, value, name, x2.attrib.get(name)))
            return False
    for name in x2.attrib.keys():
        if name not in x1.attrib:
            if reporter:
                reporter('x2 has an attribute x1 is missing: %s'
                         % name)
            return False
    if not text_compare(x1.text, x2.text):
        if reporter:
            reporter('text: %r != %r' % (x1.text, x2.text))
        return False
    if not text_compare(x1.tail, x2.tail):
        if reporter:
            reporter('tail: %r != %r' % (x1.tail, x2.tail))
        return False
    cl1 = x1.getchildren()
    cl2 = x2.getchildren()
    if len(cl1) != len(cl2):
        if reporter:
            reporter('children length differs, %i != %i'
                     % (len(cl1), len(cl2)))
        return False
    i = 0
    for c1, c2 in zip(cl1, cl2):
        i += 1
        if not xml_compare(c1, c2, reporter=reporter):
            if reporter:
                reporter('children %i do not match: %s'
                         % (i, c1.tag))
            return False
    return True


def text_compare(t1, t2):
    if not t1 and not t2:
        return True
    if t1 == '*' or t2 == '*':
        return True
    return (t1 or '').strip() == (t2 or '').strip()


def make_xml(s):
    return ET.XML('<xml>%s</xml>' % s)


def make_string(xml):
    if isinstance(xml, (str, unicode)):
        xml = make_xml(xml)
    s = ET.tostring(xml)
    if s == '<xml />':
        return ''
    assert s.startswith('<xml>') and s.endswith('</xml>'), repr(s)
    return s[5:-6]


def install():
    doctest.OutputChecker = HTMLOutputChecker

This is pretty nice, but is it possible to make this shorter? — interestinglythere, Oct 19 '15 at 00:11
Making it shorter is left exercise for the reader. You are asking an answer for non-trivial problem :) — Mikko Ohtamaa, Oct 19 '15 at 00:22

interestinglythere · Answer 2 · 2015-10-19T00:15:23.957

Sidenote: <\head> is not a valid HTML tag and will be interpreted as text. HTML close tags look like this: </head>

As other answerers may tell you, using a library that actually knows what a DOM is is probably the most reliable option if you're comparing well-structured, complete HTML documents or fragments. A simpler solution than using a DOM is to use regex to match HTML tags.

It's simple (can be done in two lines).
It's reliable in everything I've tested so far, but can give unexpected results when, for example, HTML tags appear in <pre> or <textarea> elements.
Will work with partial HTML fragments like </head>, while DOM/parsing libraries might complain that a <head> tag is missing.

Demo
Following is some code that normalizes HTML input (the HTML of this page, actually) by finding all the tags and printing them in succession.

import re, urllib
f = urllib.urlopen('http://stackoverflow.com/questions/33204018/html-structure-diff-in-python')
html = f.read()
for m in re.finditer(r'''</?\w+((\s+\w+(\s*=\s*(?:".*?"|'.*?'|[^'">\s]+))?)+\s*|\s*)/?>''', html):
    print m.group(0)

You can take the output from the above and use whatever command-line diff tool you prefer to compare them.

Or maybe you want to compare them using Python. Instead of printing out all the lines, you might be interested in concatenating them into a single string:

tags_as_string = ''
for m in re.finditer(r'''</?\w+((\s+\w+(\s*=\s*(?:".*?"|'.*?'|[^'">\s]+))?)+\s*|\s*)/?>''', html):
    s += m.group(0) + '\n'  # the newline makes diff output look nicer

or list:

tags_as_list = []
for m in re.finditer(r'''</?(\w+)((\s+\w+(\s*=\s*(?:".*?"|'.*?'|[^'">\s]+))?)+\s*|\s*)/?>''', html):
    s.append(m.group(0))

Further steps to consider (can be done inside the for loop):

Perhaps you're only interested in the tag name and not the attributes. The tag name can be accessed with m.group(1) (the first regex group in parentheses) in the for-loop.
Tags that mean the same thing still might be different due to whitespace. You might want to normalize out the whitespace within each tag using a similar technique.

Credit: The actual regex is from http://haacked.com/archive/2004/10/25/usingregularexpressionstomatchhtml.aspx/

It is **strongly** recommended not doing **any** regular expressions on HTML. For further information please see this answer http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 — Mikko Ohtamaa, Oct 19 '15 at 00:21
Thanks, I've seen that answer many times before, but it doesn't tell me why. This answer http://stackoverflow.com/a/590789/782045 tells us why: "it depends on matching the opening and the closing tag which is not possible with regexps." Turns out, matching the opening and closing tags is not needed for this question. I believe the tags themselves (no matching) indeed do form a natural language, which regex is perfect for. — interestinglythere, Oct 19 '15 at 10:47
Because HTML is inheritantly compex and regexs is not sufficient for it. Pitfall cases includes e.g. tag-like strings in inline JavaScript, tag-like strings in inline CSS, CDATA sections, HTML entity encoding. — Mikko Ohtamaa, Oct 19 '15 at 17:45
I'm starting to understand what you mean – perhaps regex is not best suited for this because it will be very long. However, each of these things you described does not require us to understand nesting levels (none of those things can be nested inside themselves), which means that the language is still regular and can be processed with regex. — interestinglythere, Oct 19 '15 at 21:29

HTML structure diff in Python

2 Answers2