Text comparison algorithm

Question

We have a requirement in the project that we have to compare two texts (update1, update2) and come up with an algorithm to define how many words and how many sentences have changed.

Are there any algorithms that I can use?

I am not even looking for code. If I know the algorithm, I can code it in Java.

http://stackoverflow.com/questions/65199/c-sharp-compare-algorithms — Mitch Wheat, Jan 30 '12 at 14:41

score 23 · Accepted Answer · answered Jan 30 '12 at 14:40

23

Typically this is accomplished by finding the Longest Common Subsequence (commonly called the LCS problem). This is how tools like diff work. Of course, diff is a line-oriented tool, and it sounds like your needs are somewhat different. However, I'm assuming that you've already constructed some way to compare words and sentences.

answered Jan 30 '12 at 14:40

FatalError

52,695
14
99
116

There is a frontend to diff(1) called wdiff(1), which works on a word-by-word basis. – vonbrand Jun 01 '18 at 16:13

score 19 · Answer 2 · answered Jan 31 '12 at 11:05

19

An O(NP) Sequence Comparison Algorithm is used by subversion's diff engine.

For your information, there are implementations with various programming languages by myself in following page of github.

https://github.com/cubicdaiya/onp

answered Jan 31 '12 at 11:05

cubicdaiya

368
1
6

score 12 · Answer 3 · answered Jan 30 '12 at 14:44

Some kind of diff variant might be helpful, eg wdiff

If you decide to devise your own algorithm, you're going to have to address the situation where a sentence has been inserted. For example for the following two documents:

The men are bad. I hate the men

and

The men are bad. John likes the men. I hate the men

Your tool should be able to look ahead to recognise that in the second, I hate the men has not been replaced by John likes the men but instead is untouched, and a new sentence inserted before it. i.e. it should report the insertion of a sentence, not the changing of four words followed by a new sentence.

score 8 · Answer 4 · edited May 23 '17 at 12:32

Here are two papers that describe other text comparison algorithms that should generally output 'better' (e.g. smaller, more meaningful) differences:

Tichy, Walter F., "The String-to-String Correction Problem with Block Moves" (1983). Computer Science Technical Reports. Paper 378.
Paul Heckel, "A Technique for Isolating Differences Between Files", Communications of the ACM, April 1978, Volume 21, Number 4

The first paper cites the second and mentions this about its algorithm:

Heckel[3] pointed out similar problems with LCS techniques and proposed a linear-lime algorithm to detect block moves. The algorithm performs adequately if there are few duplicate symbols in the strings. However, the algorithm gives poor results otherwise. For example, given the two strings aabb and bbaa, Heckel's algorithm fails to discover any common substring.

The first paper was mentioned in this answer and the second in this answer, both to the similar SO question:

Is there a diff-like algorithm that handles moving block of lines? - Stack Overflow

score 8 · Answer 5 · answered Jan 30 '12 at 15:37

8

The specific algorithm used by diff and most other comparison utilities is Eugene Myer's An O(ND) Difference Algorithm and Its Variations. There's a Java implementation of it available in the java-diff-utils package.

answered Jan 30 '12 at 15:37

Zoë Peterson

13,094
2
44
64

score 1 · Answer 6 · answered Sep 09 '15 at 21:23

1

The difficulty comes when comparing large files efficiently and with good performance. I therefore implemented a variation of Myers O(ND) diff algorithm - which performs quite well and accurate (and supports filtering based on regular expression):

Algorithm can be tested out here: becke.ch compare tool web application

And a little bit more information on the home page: becke.ch compare tool

answered Sep 09 '15 at 21:23

becke.ch

29
4

Amazing tool! But the download page is not available (404). Could you please direct me where I can download it? – Sam Sirry Apr 26 '20 at 21:17

score 0 · Answer 7 · answered Dec 19 '20 at 22:04

0

The most famous algorithm is O(ND) Difference Algorithm, also used in Notepad++ compare plugin (written in C++) and GNU diff(1). You can find a C# implementation here: http://www.mathertel.de/Diff/default.aspx

answered Dec 19 '20 at 22:04

ozanmut

2,898
26
22

Text comparison algorithm

7 Answers7

Linked

Related