3

When I go to (for example) review section here, I see these beautifully formatted text diffs:

text diff

Or the combined one:

combined text diff

Now, just how the hell is this done? I'd love to include this to my site, but I can't really figure out the algorithm. Is this documented somewhere? Is there an open-source implementation for this, preferably in PHP?

Thanks for any help.

cypher
  • 6,822
  • 4
  • 31
  • 48
  • 1
    I have a series of articles on that topic here: http://devdirective.com/post/91/creating-a-reusable-though-simple-diff-implementation-in-csharp-part-1 - The code involved is in C# but the theory should be relatively easy to implement in any programming language. – Lasse V. Karlsen Jun 05 '14 at 17:45
  • 1
    [Longest common subsequence](http://en.wikipedia.org/wiki/Longest_common_subsequence_problem). Find the common of `text1` and `text2`, the rest is what's marked on red/green, depending if it's in the original or new version. – amit Jun 05 '14 at 17:48
  • @LasseV.Karlsen this looks easy indeed, thank you for that. – cypher Jun 05 '14 at 17:48
  • 1
    There are of course flaws in it... look where it says **work** becomes **to work**, when really there should have just been a **to** added – BenVlodgi Jun 05 '14 at 17:50
  • [Text difference algorithm](http://stackoverflow.com/q/145607) – Bernhard Barker Jun 05 '14 at 18:27
  • Asking how [so] itself works would be off topic (that's rather a [meta] question), but the posts I linked to are about diff algorithms in general. – Bernhard Barker Jun 05 '14 at 18:31

1 Answers1

3

From the images that you posted, and my own (albeit little experience) it seems that the website uses a modification of the longest common sub sequence algorithm. This explains why it never shows rearrangement / shuffling of words.

The first modification is that instead of thinking of alphabets as atomic units, they consider words as atomic units. (also punctuation)

Secondly, the algorithm is relatively naive, it points out that you crossed out "work" when you actually just inserted a to there. It seems to just mark discontinuities of any kind (insertions, deletions, modifications) and crosses out one word or the whole discontinuation portion.

Thirdly, everything in the second list not a part of the first list is marked in green.

Seems relatively easy to implement. Check out some tutorial on dynamic programming.

recursive
  • 83,943
  • 34
  • 151
  • 241
shebang
  • 413
  • 2
  • 8
  • Won't using words as atomic units have a bad output if i diff for example "program" to "reprogram"? It will count as two completely different strings, right? – cypher Jun 05 '14 at 18:26
  • @cypher Correct. It will count as different strings. If you don't use words as atomic units, words such as host and honest will also show host as an LCS which may not be acceptable. It depends on how you define the difference. In a line of thinking, changing "program" to "reprogram" is best represented by cutting out the whole word. This reduces clutter and improves readability. – shebang Jun 05 '14 at 18:37
  • The part of the site I want to include this is correctors checking over and improving user's articles, so we're talking mainly grammar and stylistic differences, in which case I don't think i should consider words atomic units (mainly for grammar purposes), but I'm certainly going to try both approaches. – cypher Jun 05 '14 at 18:43
  • @cypher I gave you the technique SO seems to be using. I'd recommend you to read on some other Edit Distance (wiki) metrics to see other approaches. – shebang Jun 05 '14 at 19:04