5

I'm trying to get my head around a problem of identifying the best match of English words from a dictionary file to a given string.

For example ("lines" being a List of dictionary words):

string testStr = "cakeday";

for (int x= 0; x<= testStr.Length; x++)
{
  string test = testStr.Substring(x);

   if (test.Length > 0)
   {
      string test2 = testStr.Remove(counter);
      int count = (from w in lines where w.Equals(test) || w.Equals(test2) select w).Count();
      Console.WriteLine("Test: {0} / {1} : {2}", test, test2, count);
    }
}

Gives the output:

Test: cakeday /   : 0
Test: akeday / c  : 1
Test: keday / ca  : 0
Test: eday / cak  : 0
Test: day / cake  : 2
Test: ay / caked  : 1
Test: y / cakeda  : 1

Obviously "day / cake" is the best fit for the string however if I were to introduce a 3rd word into the string e.g "cakedaynow" it doesnt work so well.

I know the example is primitive, its more a proof of concept and was wondering if anyone had any experience with this type of string analysis?

Thanks!

chrr
  • 320
  • 1
  • 4
  • 15

2 Answers2

2

You'll want to research the class of algorithms appropriate to what you're trying to do. Start with Approximate string matching on Wikipedia.

Also, here's a Levenshtein Edit Distance implementation in C# to get you started:

using System;

namespace StringMatching
{
    /// <summary>
    /// A class to extend the string type with a method to get Levenshtein Edit Distance.
    /// </summary>
    public static class LevenshteinDistanceStringExtension
    {
        /// <summary>
        /// Get the Levenshtein Edit Distance.
        /// </summary>
        /// <param name="strA">The current string.</param>
        /// <param name="strB">The string to determine the distance from.</param>
        /// <returns>The Levenshtein Edit Distance.</returns>
        public static int GetLevenshteinDistance(this string strA, string strB)
        {
            if (string.IsNullOrEmpty(strA) && string.IsNullOrEmpty(strB))
                return 0;

            if (string.IsNullOrEmpty(strA))
                return strB.Length;

            if (string.IsNullOrEmpty(strB))
                return strA.Length;

            int[,] deltas; // matrix
            int lengthA;
            int lengthB;
            int indexA;
            int indexB;
            char charA;
            char charB;
            int cost; // cost

            // Step 1
            lengthA = strA.Length;
            lengthB = strB.Length;

            deltas = new int[lengthA + 1, lengthB + 1];

            // Step 2
            for (indexA = 0; indexA <= lengthA; indexA++)
            {
                deltas[indexA, 0] = indexA;
            }

            for (indexB = 0; indexB <= lengthB; indexB++)
            {
                deltas[0, indexB] = indexB;
            }

            // Step 3
            for (indexA = 1; indexA <= lengthA; indexA++)
            {
                charA = strA[indexA - 1];

                // Step 4
                for (indexB = 1; indexB <= lengthB; indexB++)
                {
                    charB = strB[indexB - 1];

                    // Step 5
                    if (charA == charB)
                    {
                        cost = 0;
                    }
                    else
                    {
                        cost = 1;
                    }

                    // Step 6
                    deltas[indexA, indexB] = Math.Min(deltas[indexA - 1, indexB] + 1, Math.Min(deltas[indexA, indexB - 1] + 1, deltas[indexA - 1, indexB - 1] + cost));
                }
            }

            // Step 7
            return deltas[lengthA, lengthB];
        }
    }
}
JamieSee
  • 12,696
  • 2
  • 31
  • 47
1

Why not:

Check all the strings inside the search word extracting from current search position to all possible lengths of the string and extract all discovered words. E.g.:

var list = new List<string>{"the", "me", "cat", "at", "theme"};
const string testStr = "themecat";
var words = new List<string>();
var len = testStr.Length;
for (int x = 0; x < len; x++)
{
    for(int i = (len - 1); i > x; i--)
    {
        string test = testStr.Substring(x, i - x + 1);
        if (list.Contains(test) && !words.Contains(test))
        {
            words.Add(test);
        }
    }
}

words.ForEach(n=> Console.WriteLine("{0}, ",n));//spit out current values

Output:

theme, the, me, cat, at

Edit

Live Scenario 1: For instance let's say you want to always choose the longest word in a jumbled sentence, you could read from front forward, reducing the amount of text read till you are through. Using a dictionary makes it much easier, by storing the indexes of the discovered words, we can quickly check to see if we have stored a word containing another word we are evaluating before.

Example:

var list = new List<string>{"the", "me", "cat", "at", "theme", "crying", "them"};
const string testStr = "themecatcryingthem";
var words = new Dictionary<int, string>();
var len = testStr.Length;
for (int x = 0; x < len; x++)
{
    int n = len > 28 ? 28 : len;//assuming 28 is the maximum length of an english word
    for(int i = (n - 1); i > x; i--)
    {
        string test = testStr.Substring(x, i - x + 1);
        if (list.Contains(test))
        {
            if (!words.ContainsValue(test))
            {
                bool found = false;//to check if there's a shorter item starting from same index
                var key = testStr.IndexOf(test, x, len - x);
                foreach (var w in words)
                {
                    if (w.Value.Contains(test) && w.Key != key && key == (w.Key + w.Value.Length - test.Length))
                    {
                        found = true;
                    }
                }
                if (!found && !words.ContainsKey(key)) words.Add(key, test);
            }
        }
    }
}

words.Values.ToList().ForEach(n=> Console.WriteLine("{0}, ",n));//spit out current values

Output:

theme, cat, crying, them

Chibueze Opata
  • 9,856
  • 7
  • 42
  • 65
  • I tried your code but unfortunately it didn't give me the results I was hoping for. I should add that the entire string must be matched in as few words as possible, so for ex "themecat" should output to "theme cat" rather than "the me cat" as its the shortest path so to speak. – chrr Jul 30 '12 at 18:24
  • Actually I think i've found a solution on another SO question, now to convert it to C# :) http://stackoverflow.com/a/195024/243935 – chrr Jul 30 '12 at 18:28
  • @chrr: The second approach in my code addresses that. I have done a lot of projects on AI and string manipulation so this sort of thing is not new or complex to me at all. – Chibueze Opata Jul 30 '12 at 19:02
  • In lay man terms, you can just fetch "theme cat" correctly from "themecat" using the above by iterating through each word in the list and removing any other string they contain that are words in the list. I can update the code for you with this if you want. – Chibueze Opata Jul 30 '12 at 19:17
  • Done, but written in a bit of a hurry. I'd like to see the performance results if you get to use it for a large chunk of data. You can check [wordnet](http://wordnet.princeton.edu/wordnet/related-projects/#.NET) for some dictionary collections. – Chibueze Opata Aug 02 '12 at 01:30