5

I have two pieces of text. I would like to make a word-based diff between them (like whe unix utility wdiff does) but with more information in the output (I mean, the character's posizion where the added/delited word starts).

I need to do this in Java, so a simple output of the differences (like wdiff) doesn't suite for me: I would like to manipulate objects representing differences.

Josh Lee
  • 171,072
  • 38
  • 269
  • 275
Mycol
  • 245
  • 3
  • 10
  • see http://stackoverflow.com/questions/479654/java-library-for-free-text-diff – mdma May 24 '10 at 16:51
  • Thanks but it's not what I'm searching: i would like to do a word based diff but not simply have the output, but being able to manipulate these data. In my mind there is a java object with these fields: - Add/Delete - String (word) added (or deleted) - Position of add/delete in the first (or second, btw) file – Mycol May 24 '10 at 17:05

1 Answers1

3

There's Diff,Match,Patch - available in Java, and a demo is avilable - it seems to do word differences.

mdma
  • 56,943
  • 12
  • 94
  • 128
  • 1
    I tried a lot it and is baically char-based. If you want a human outuput you have to set a very high time, the computation is really slow and however is not word based (i mean "house" and "wife" are find to be different only in "hous" and "wif") – Mycol May 24 '10 at 17:00
  • Did you see the section on post-processing cleanup? You may be able to add a post processor that aligns differences to words. Is it for English text? When you raise the level to words, the problem becomes more complex. Even just tokenizing the text accurately into words is some effort, and then you have the problem of disambiguating differences - changes can be interpreted in several ways - which one makes sense may depend upon your application. Dealing with blocks of text cut and pasted to a different place is in principle one operation, but detecting this can be difficult. – mdma May 24 '10 at 17:18
  • 1
    If you can map words to characters (e.g. ensure there are no more than 64k unique words.) Then you can parse the text yourself, map each word to a character and run character differencing on that. Of course, if the implementation of the Diff algorithm is such that you can easily replace the data types being compared, then you may be able to trivially implement word differencing, by passing word objects as input rather than chars. I haven't seen the Diff api, so I can't say for sure. – mdma May 24 '10 at 17:20