9

i want use python to diff two html files:

example :

html_1 = """
<p>i love it</p>
"""
html_2 = """ 
<h2>i love it </p>
"""

the diff file will like this :

diff_html = """
<del><p>i love it</p></dev><ins><h2>i love it</h2></ins>
"""

is there such python lib help me do this ?

mike
  • 1,127
  • 4
  • 17
  • 34
  • possible duplicate of [Generate pretty diff html in Python](http://stackoverflow.com/questions/1576459/generate-pretty-diff-html-in-python) – bummi Apr 26 '15 at 06:00

6 Answers6

13

lxml can do something similar to what you want. From the docs:

>>> from lxml.html.diff import htmldiff
>>> doc1 = '''<p>Here is some text.</p>'''
>>> doc2 = '''<p>Here is <b>a lot</b> of <i>text</i>.</p>'''
>>> print htmldiff(doc1, doc2)
<p>Here is <ins><b>a lot</b> of <i>text</i>.</ins> <del>some text.</del> </p>

I don't know of any other Python library for this specific task, but you may want to look into word-by-word diffs. They may approximate what you want.

One example is this one, implemented in both PHP and Python (save it as diff.py, then import diff)

>>> diff.htmlDiff(a,b)
>>> '<del><p>i</del> <ins><h2>i</ins> love <del>it</p></del> <ins>it </p></ins>'
Eduardo Ivanec
  • 11,668
  • 2
  • 39
  • 42
  • i have try it , but what i got is `>>> from lxml.html.diff import htmldiff >>> doc1 = '''

    Here is some text.

    ''' >>> doc2 = '''

    Here is some text.

    ''' >>> print htmldiff(doc1, doc2)

    Here is some text.

    `
    – mike Mar 05 '12 at 06:17
  • my tag p have changed to chagne h2 , but it didn't show the difference – mike Mar 05 '12 at 06:20
  • the lxml.html.diff docuemnt say: **Markup is generally ignored,** with the markup from new_html preserved, and possibly some markup from old_html (though it is considered acceptable to lose some of the old markup). Only the words in the HTML are diffed. – mike Mar 05 '12 at 06:29
  • You're right, it works much worse than I remembered. I added an alternative, it's not exactly what you want but may be of help. – Eduardo Ivanec Mar 05 '12 at 13:09
  • That's nice! If it works for you you should post it as an answer and accept it. – Eduardo Ivanec Mar 05 '12 at 14:09
2

Checkout diff2HtmlCompare (full disclosure: I'm the author). If you're trying to just visualize the differences, then this may help you. If you are trying to extract the differences and do something with it, then you can use difflib as suggested by others (the script above just wraps difflib and uses pygments for syntax highlighting). Doug Hellmann has done a pretty good job detailing how to use difflib, I'd suggest checking out his tutorial.

wagoodman
  • 143
  • 2
  • 7
2

i fount two python lib that's helpfull:

  1. htmltreediff
  2. htmldiff

but , both of it use python's difflib lib to diff text. but i want to use google's diff .

mike
  • 1,127
  • 4
  • 17
  • 34
1

You could use difflib.ndiff() to look for and replace the "-"/"+" with your desired HTML.

import difflib

html_1 = """
<p>i love it</p>
"""
html_2 = """
<h2>i love it </p>
"""

diff_html = ""
theDiffs = difflib.ndiff(html_1.splitlines(), html_2.splitlines())
for eachDiff in theDiffs:
    if (eachDiff[0] == "-"):
        diff_html += "<del>%s</del>" % eachDiff[1:].strip()
    elif (eachDiff[0] == "+"):
        diff_html += "<ins>%s</ins>" % eachDiff[1:].strip()

print diff_html

The result:

<del><p>i love it</p></del><ins><h2>i love it </p></ins>
Nate
  • 18,892
  • 27
  • 70
  • 93
0

AFAIK, python has a build in difflib that can do this.

HunnyBear
  • 117
  • 1
  • 1
  • 10
Snakes and Coffee
  • 8,747
  • 4
  • 40
  • 60
0

Not exactly what your output is, but the standard library difflib has a simple htmldiff tool in it, which will build a html diff table for you.

import difflib

html_1 = """
<p>i love it</p>
"""
html_2 = """ 
<h2>i love it </p>
"""

htmldiff = difflib.HtmlDiff()
html_table = htmldiff.make_table([html_1], [html_2]) # each item is a list of lines
Nate
  • 18,892
  • 27
  • 70
  • 93
monkut
  • 42,176
  • 24
  • 124
  • 155