20

I need to match up two almost-the-same long freetext strings; i.e., to find index-to-index correspondences wherever possible.

Because this is freetext, the comparison should not be line-based as in code diffing.

Any suggestions for Java libraries?

A simple example (In real life , of course, there would not be extra whitespace to line things up, and there may be more complex challenges like entire clauses moved around.)

The quick brown  fox jumped over the  lazy     dog.
||||||||||      |||||||||||||||||||||         |||||
The quick yellow fox jumped over the well-bred dog.
Joshua Fox
  • 18,704
  • 23
  • 87
  • 147
  • Well one problem you've got is that text diff tools/libraries typically operate on a line-by-line basis, meaning they only delineate between lines being same or different and if they're different is it because other lines have been inserted/removed? – cletus Jan 26 '09 at 12:38
  • Are the multiple space characters just for show or do you really only want to find correspondences if common subsequences start at the same index? If the latter, the problem is trivial - no need for a library, just write the code yourself! – Christoph Jan 26 '09 at 14:41
  • @Christoph "Are the multiple space characters just for show o" -- as mentioned above, these are just to help in illustrating the match. – Joshua Fox Apr 27 '11 at 19:41
  • 1
    See the pseudo-code of the **Needleman-Wunsch algorithm** ( [http://en.wikipedia.org/wiki/Needleman-Wunsch_algorithm](http://en.wikipedia.org/wiki/Needleman-Wunsch_algorithm)) used in bioinformatics to align two DNA sequence. – Pierre Jan 26 '09 at 13:22

4 Answers4

17

This one might be good Diff Match Patch.

Michael
  • 41,989
  • 11
  • 82
  • 128
Joshua Fox
  • 18,704
  • 23
  • 87
  • 147
8

Depending on your exact requirements, the StringUtils class of the Apache Commons Lang component might be helpful, e.g.:

  • StringUtils#difference: Compares two Strings, and returns the portion where they differ
  • StringUtils#getLevenshteinDistance: Find the Levenshtein distance between two Strings
Fabian Steeg
  • 44,988
  • 7
  • 85
  • 112
1

Here's a (lightly-tested) version of code that does what you asked. You can easily traverse the result in parallel with the inputs to locate insertions and deletions.

public class StringDiff {

    private static int   length(String s) { return s == null ? 0 : s.length(); }
    private static char[] chars(String s) { return s == null ? new char[0] : s.toCharArray(); }

    private final String left;
    private final String right;

    private final char[] lccs;
    private final String lcs;

    public StringDiff(String left, String right) {
        this.left = left;
        this.right = right;
        lccs = init();
        lcs = new String(lccs);
    }

    public String getLcs()  { return lcs; }
    public char[] getLccs() { return lccs.clone(); }

    private char[] init() {
        int lLength = length(left);
        int rLength = length(right);
        char[] lChars = chars(left);
        char[] rChars = chars(right);
        int [][] t = new int [lLength + 1][rLength + 1];
        for (int i = lLength - 1; i >= 0; --i) {
            for (int j = rLength - 1; j >= 0; --j) {
                if (lChars[i] == rChars[j]) {
                    t[i][j] = t[i + 1][j + 1] + 1;
                } else {
                    t[i][j] = Math.max(t[i + 1][j], t[i][j + 1]);
                }
            }
        }
        char[] result = new char[t[0][0]];
        int l = 0, r = 0, p = 0;
        while (l < lLength && r < rLength) {
            if (lChars[l] == rChars[r]) {
                result[p++] = lChars[l++];
                r++;
            } else {
                if (t[l + 1][r] > t[l][r + 1]) {
                    ++l;
                } else {
                    ++r;
                }
            }
        }
        return result;
    }

}

According to it, the actual longest subsequence of your original inputs:

The quick brown  fox jumped over the  lazy     dog.
The quick yellow fox jumped over the well-bred dog.

is:

The quick ow fox jumped over the l dog.

(because "brown" and "yellow" have "ow" in common, etc.)

It's relatively straightforward to modify the above to split on whitespace (instead of into char arrays) and substitute String#equals for == to get a version that finds the longest common subsequence of words instead of characters. For your example above that change would produce the obvious result:

found 7 words
    'The'
    'quick'
    'fox'
    'jumped'
    'over'
    'the'
    'dog.'

(Your question implied character comparisons, as you matched the spaces between words.)

joel.neely
  • 30,725
  • 9
  • 56
  • 64
  • That is not what he asked for is it? It should be index to index based and thus not return a match for ow and yellow and brown since they are not on the same index? – willcodejavaforfood Jan 26 '09 at 13:36
0

If you're example is really what you want to do - ie subsequences only match if they start at the same index (which is different from how diffs normally operate) - this is all you need to do:

import java.util.*;

class StringDiff {
    public static List<int[]> from(String s1, String s2) {
        int start = -1;
        int pos = 0;
        LinkedList<int[]> list = new LinkedList<int[]>();

        for(; pos < s1.length() && pos < s2.length(); ++pos) {
            if(s1.charAt(pos) == s2.charAt(pos)) {
                if(start < 0) start = pos;
            }
            else {
                if(start >= 0) list.add(new int[] { start, pos });
                start = -1;
            }
        }

        if(start >= 0) list.add(new int[] { start, pos });

        return list;
    }

    public static void main(String[] args) {
        for(int[] idx : from(args[0], args[1]))
            System.out.println(args[0].substring(idx[0], idx[1]));
    }
}

An actual diff implementation will be far more sophisticated.

Christoph
  • 164,997
  • 36
  • 182
  • 240