35

I have two chunks of text that I would like to compare and see which words/lines have been added/removed/modified in Python (similar to a Wiki's Diff Output).

I have tried difflib.HtmlDiff but it's output is less than pretty.

Is there a way in Python (or external library) that would generate clean looking HTML of the diff of two sets of text chunks? (not just line level, but also word/character modifications within a line)

The Unknown
  • 19,224
  • 29
  • 77
  • 93

7 Answers7

33

There's diff_prettyHtml() in the diff-match-patch library from Google.

Zach Young
  • 10,137
  • 4
  • 32
  • 53
tonfa
  • 24,151
  • 2
  • 35
  • 41
  • The .zip download link now gives a 404 :( – Richard H Oct 23 '16 at 19:43
  • 2
    It's hard to tell if there's a way to generate a good side-by-side diff of multiple-line files with diff-match-patch. It seems mostly focused on character-level comparison, and the documentation on line-level is not very good (and the example is only in JavaScript). – aldel Jan 30 '20 at 19:39
  • 7
    Also I think its new home is here: https://github.com/google/diff-match-patch – aldel Jan 30 '20 at 19:39
26

Generally, if you want some HTML to render in a prettier way, you do it by adding CSS.

For instance, if you generate the HTML like this:

import difflib
import sys

fromfile = "xxx"
tofile = "zzz"
fromlines = open(fromfile, 'U').readlines()
tolines = open(tofile, 'U').readlines()

diff = difflib.HtmlDiff().make_file(fromlines,tolines,fromfile,tofile)

sys.stdout.writelines(diff)

then you get green backgrounds on added lines, yellow on changed lines and red on deleted. If I were doing this I would take take the generated HTML, extract the body, and prefix it with my own handwritten block of HTML with lots of CSS to make it look good. I'd also probably strip out the legend table and move it to the top or put it in a div so that CSS can do that.

Actually, I would give serious consideration to just fixing up the difflib module (which is written in python) to generate better HTML and contribute it back to the project. If you have a CSS expert to help you or are one yourself, please consider doing this.

Michael Dillon
  • 31,973
  • 6
  • 70
  • 106
  • 3
    Someone implemented your proposal (as I often find is the case with Python). HtmlDiff has make_table() method which just creates the HTML table. So user can add own CSS to prettify it. Compared to accepted answer, this is included (from py 2.4). – Peter M. - stands for Monica Jan 14 '16 at 17:18
  • Unfortunately the HTML generated by `difflib.HtmlDiff` is a pretty archaic table format that isn't well suited to customization with CSS. But it still works pretty well, if you don't need a lot of customization. You can probably change colors and fonts, but that's about it. The big secret that I almost missed is the `wrapcolumn` argument to the constructor, which lets you prevent the table from being arbitrarily wide. – aldel Jan 30 '20 at 19:43
  • 1
    This process shows the ENTIRE file side by side even if only ONE LINE HAS CHANGED. THis is a problem if the file is large. Not sure if there's a way to fix this – Gh0sT Jun 14 '21 at 08:44
6

I recently posted a python script that does just this: diff2HtmlCompare (follow the link for a screenshot). Under the hood it wraps difflib and uses pygments for syntax highlighting.

wagoodman
  • 143
  • 2
  • 7
1

not just line level, but also word/character modifications within a line

xmldiff seems to be a nice package for this purpose especially when you have XML/HTML to compare. Read more in their documentation.

yofee
  • 1,287
  • 12
  • 25
0

try first of all clean up both of HTML by lxml.html, and the check the difference by difflib

Oduvan
  • 2,607
  • 3
  • 24
  • 24
0

Since the .. library from google seams to have no active development any more, I suggest to use diff_py

From the github page:

The simple diff tool which is written by Python. The diff result can be printed in console or to html file.

guettli
  • 25,042
  • 81
  • 346
  • 663
-1

A copy of my own answer from here.


What about DaisyDiff (Java and PHP vesions available).

Following features are really nice:

  • Works with badly formed HTML that can be found "in the wild".
  • The diffing is more specialized in HTML than XML tree differs. Changing part of a text node will not cause the entire node to be changed.
  • In addition to the default visual diff, HTML source can be diffed coherently.
  • Provides easy to understand descriptions of the changes.
  • The default GUI allows easy browsing of the modifications through keyboard shortcuts and links.
Community
  • 1
  • 1
elhoim
  • 6,705
  • 2
  • 23
  • 29