4

I need to do Word by word comparison of two strings. Something like diff, but for words, not for lines.

Like it is done in wikipedia http://en.wikipedia.org/w/index.php?title=Horapollo&action=historysubmit&diff=21895647&oldid=21893459

In result I want return the two arrays of indexes of words, which are different in two string.

Are there any libraries/frameworks/standalone_methods for .NET which can do this?

P.S. I want to compare several kilobytes of text

Alex Blokha
  • 1,141
  • 2
  • 17
  • 30
  • Duplicate of http://stackoverflow.com/questions/473522/word-comparison-algorithm – Josh Stodola Nov 23 '09 at 22:20
  • 2
    First, break up the strings into two arrays of words. Then it's quite straightforward to find the strings that are _the same_ in two arrays. And if you can do that, then surely you can find the words that are different. Here's a simple example in JScript; turning it into C# only takes a few minutes. http://beta.blogs.msdn.com/ericlippert/archive/2004/07/21/recursion-and-dynamic-programming.aspx – Eric Lippert Nov 23 '09 at 23:12

7 Answers7

4

Actually, you probably want to implement a variation of the Local Alignment/Global Alignment algorithms we use in DNA sequence alignments. This is because you probably cannot do a word-by-word comparison of the two strings. I.e:

The quick brown fox jumps over the lazy dog
The quick fox jumps over the lazy dog

In other words, if you cannot identify insertions and deletions of whole words, your comparison algorithm can become very sc(r)ewed. Take a look at the Smith-Waterman algorithm and the Needleman-Wunsch algorithm and find a way to adapt them to your needs. Since such a search space can become very large if the strings are long, you could also check out BLAST. BLAST is a very common heuristic algorithm, and is pretty much the standard in genetic searches.

Pedery
  • 3,632
  • 1
  • 27
  • 39
  • I didn't get, why I cann't do word-by-word comparison of the two strings? What I want is like you said - identify insertions and deletions of whole words. – Alex Blokha Nov 12 '10 at 14:19
  • Because if you compare word by word, your comparison algorithm can quickly become very complicated. The above example is trivial, but illustrates the point The sequence algorithms I proposed were designed to identify gaps and insertions in comparable sequences. PS: don't forget to reward answers you find helpful. After all, that's how this community stays alive. Click the up arrow images next to useful answers. – Pedery Nov 20 '10 at 02:56
3

It seems I've found needed solution:

DiffPlex is a combination of a .NET Diffing Library with both a Silverlight and HTML diff viewer. http://diffplex.codeplex.com/

But It has one bug. In those lines "Hello-Kitty" "Hello - Kitty", the word "Hello" will be marked as difference. Although the difference is space symbol.

Alex Blokha
  • 1,141
  • 2
  • 17
  • 30
2

Use RegularExpressions.

Like in the example:

using System;
using System.Collections.Generic;
using System.ComponentModel;
using System.Data;
using System.Drawing;
using System.Text;
using System.Windows.Forms;
using System.Collections.Specialized;

namespace WindowsApplication10
{
    public partial class Form1 : Form
    {
        public Form1()
        {
            InitializeComponent();
        }

        private void button2_Click(object sender, EventArgs e)
        {
            decimal discrimation = 0.75M;
            string formHeading = "The brown dog jumped over the red lazy river, and then took a little nap! Fun!";
            string userSearch = "The brown dog jumped over the red lazy river, and then took a little ";
            //string userSearch = "brown dog nap fun";
            decimal res = CompareText(formHeading, userSearch);

            if (res >= discrimation)
            {
                MessageBox.Show("MATCH!" + res.ToString());
            }
            else 
            {
                MessageBox.Show("does not match! " + res.ToString());
            }
        }


        /// <summary>
        /// Returns a percentage of 1 on how many words were matched
        /// </summary>
        /// <returns></returns>
        private decimal CompareText(string formHeading, string userSearch)
        {
            StringCollection formHeadingWords = new StringCollection();
            StringCollection userSearchWords = new StringCollection();
            formHeadingWords.AddRange(System.Text.RegularExpressions.Regex.Split(formHeading, @"\W"));
            userSearchWords.AddRange(System.Text.RegularExpressions.Regex.Split(userSearch, @"\W"));

            int wordsFound = 0;
            for (int i1 = 0; i1 < userSearchWords.Count; i1++)
            {
                if (formHeadingWords.Contains(userSearchWords[i1]))
                    wordsFound += 1;
            }
            return (Convert.ToDecimal(wordsFound) / Convert.ToDecimal(formHeadingWords.Count));
        }
    }
}
serhio
  • 28,010
  • 62
  • 221
  • 374
1

you can replace all the words in your 2 texts with unique numbers, take some ready made code for Edit distance computation and replace it's character to character comparison with number to number comparison and you are done!

I am not sure if there exists any library for exactly what u want. But you will surely find lots of code for edit distance.

Further, depending on whether you want to actually want to allow substitutions or not in the edit distance computation, you can change the conditions in the dynamic programming code.

See this. http://en.wikipedia.org/wiki/Levenshtein_distance

user855
  • 19,048
  • 38
  • 98
  • 162
  • Actually I already wrote comparizon routine, but I don't like how it works, because new bugs appear from time to time, but I don't hae much time to fight, because this is the little peace of all functionality. Thats why I've looked for already wroten good tested thing. Its funny, but it seems such thing does not exist :) – Alex Blokha Nov 24 '09 at 00:57
  • @Alex: See my answer above :) – Pedery Nov 08 '10 at 12:16
1

You might try this, though I am not sure it's what you are looking for StringUtils.difference() (http://commons.apache.org/lang/api-release/org/apache/commons/lang/StringUtils.html#difference%28java.lang.String,%20java.lang.String%29)

Alternately, the Eclipse (eclipse.org) project has a diff comparison feature, which means they must also have code to determine the differences, you might browse through their API or source to see what you can find.

Good luck.

cjstehno
  • 13,468
  • 4
  • 44
  • 56
0

It seems I will use/port algorithm used here

http://www.google.com/codesearch/p?hl=en&sa=N&cd=6&ct=rc#Jc4aufN53J8/src/main/net/killingar/WordDiff.java&q=worddiff

Alex Blokha
  • 1,141
  • 2
  • 17
  • 30
0

One more library for c# is diff-match-patch - http://code.google.com/p/google-diff-match-patch/.

The bad thing it finds difference in characters. The good thing, there is instruction what you have to add to diff the words.

Alex Blokha
  • 1,141
  • 2
  • 17
  • 30