11

Currently I am using google-diff-match-patch to implement a real-time editing tool, which can synchronize texts between multiple users. Everything works great when operations are only plain texts, each user's operation(add/delete texts) could be diff-ed out by comparing to old text snapshot with the helper of google-diff. But when rich format texts(like bold/italic) are involved, google-diff not working well when comparing the htmlstring. The occurrence of character of < and > messed up the diff results, especially when bold/italic format are embedded within each other.

Could anyone suggest a similar library like google-diff to diff htmlstrings? Or any suggestions can get my problem fixed with google-diff? I understood google-diff is designed for plain text, but really didn't find a better library than it so far, so it also works if a doable enhancement to google-diff can help.

Volker Siegel
  • 3,277
  • 2
  • 24
  • 35
Steve
  • 119
  • 1
  • 1
  • 4

5 Answers5

9

The wiki at the google-diff-match-patch project shares some ideas. From http://code.google.com/p/google-diff-match-patch/wiki/Plaintext :

One method is to strip the tags from the HTML using a simple regex or node-walker. Then diff the HTML content against the text content. Don't perform any diff cleanups. This diff enables one to map character positions from one version to the other (see the diff_xIndex function). After this, one can apply all the patches one wants against the plain text, then safely map the changes back to the HTML. The catch with this technique is that although text may be freely edited, HTML tags are immutable.

Another method is to walk the HTML and replace every opening and closing tag with a Unicode character. Check the Unicode spec for a range that is not in use. During the process, create a hash table of Unicode characters to the original tags. The result is a block of text which can be patched without fear of inserting text inside a tag or breaking the syntax of a tag. One just has to be careful when reconverting the content back to HTML that no closing tags are lost.

I have a hunch that the 2nd idea, map-HTML-tags-to-Unicode-placeholders, might work better than one would otherwise guess... especially if your HTML tags are from some reduced set, and if you can perform a little open/close touchup when displaying interleaved (strikethrough/underlined) diff markup.

Another method that might work with simple styling would be remove the HTML tags, but remember the character-indexes affected. For example, "positions 8-15 are bolded". Then, perform a plaintext diff. Finally, using the diff_xIndex position-mapping idea from the wiki's first method, intelligently re-insert HTML tags to reapply stylings to the ranges surviving/added. (That is, if old positions 8-13 survived, but moved to 20-25, insert the B tags around there.)

Community
  • 1
  • 1
gojomo
  • 52,260
  • 14
  • 86
  • 115
  • And what about this: escape the html characters (<, >, &), do all the diff/patch/merge work and unescape the result. Seems to be the stablest solution to me. – ayke Jan 25 '12 at 12:31
  • 2
    I think you'd find that approach would result in the exact same output as not-escaping them. The diffing algorithm doesn't have any problem treating them like any other character; the problem is keeping them balanced, and escaping them doesn't address that. – gojomo Jan 25 '12 at 18:20
  • 2
    I went through this and ended up creating a wrapper library to help with the "presentation work" needed to use `diff_match_patch`: https://github.com/arnab/jQuery.PrettyTextDiff – or9ob Jan 24 '13 at 09:42
  • @arnab - FYI your jsfiddle demo isn't working in FF19/Mac (but does in Chrome23/Mac). – gojomo Jan 31 '13 at 05:07
  • @gojomo: Thanks for the note. Just checked out FF19 on Mac (OSX 10.8.2) - worked fine. What error do you get (maybe in console)? – or9ob Feb 01 '13 at 06:06
  • @arnab: Still having problem; OSX/10.7.3, FF/19.0 "up to date" "on the beta channel". After pressing 'Diff' button, error in console is `TypeError: $(...).prettyTextDiff is not a function http://fiddle.jshell.net/_display/ Line 38`. (There is an earlier console warning, on page load, about `getAttributeNode()` being deprecated, but that seems harmless.) – gojomo Feb 05 '13 at 19:17
  • Try now. I updated it about a month back (forgot to mention here). – or9ob Aug 01 '13 at 07:16
6

jsdifflib - A Javascript Visual Diff Tool & Library https://github.com/cemerick/jsdifflib

There's a demo here: http://cemerick.github.io/jsdifflib/demo.html

cemerick
  • 5,916
  • 5
  • 30
  • 51
user735002
  • 61
  • 1
  • 1
2

Pretty Diff does everything you need, except you will need to update the DOM response so that the diff fires against the "onkeyup" event instead on button click.

http://prettydiff.com/

austincheney
  • 807
  • 10
  • 9
0

Take a look at SynchroEdit, might be useful.

gamers2000
  • 1,847
  • 2
  • 14
  • 15
  • Gamers2000, thanks for the comment. I did tried SynchoEdit, but neither sandbox nor dev version is working. Btw, I also put an question in your original "OT library question", are you also working with google-diff-match-patc? How do you use it with rich format htmlstrings? Thanks for any comments. – Steve Jan 27 '10 at 02:17
  • Hi Steve, I am working with diff-match-patch, but I'm using it to synchronize plain text. Also, I'm actually using MobWrite(http://code.google.com/p/google-mobwrite), which is an implementation of diff-match-patch. Sorry I can't be of much help! – gamers2000 Jan 27 '10 at 03:38
0

There is another popular library called JSDiff https://github.com/kpdecker/jsdiff. It works with HTML content too. The only drawback is that it requires a new line carriage return at the end of each line to treat it as a different line. Otherwise, all the HTML content will be treated like a single line.

srth12
  • 873
  • 9
  • 16