0

I have a collection of words stored in a List object say for example the title collection here

Lorem Ipsum
Centuries
Electronic

and this is sample paragraph where I want to look for this words
lorem ipsum is simply dummy text of the printing and typesetting industry. Loren Ipsum has been the industrys standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing LorenIpsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of LoremIpsum.

My goal is, I want to extract those words in that paragraph, does not matter if it was misspelled because the goal was to correct the capitalization and misspelled words.

My expected result here is

lorem ipsum
Loren Ipsum
centuries
electornic
LorenIpsum
LoremIpsum

But not limited to these because this will run into the entire article and with hundrends of articles

sorry, I don't have any written code yet but I was planning to use RegEx for C# here.

Jayson Ragasa
  • 1,011
  • 4
  • 19
  • 33
  • What do you mean by not limited to these? – hwnd Nov 24 '14 at 04:33
  • Recommended reading: http://norvig.com/spell-correct.htm – Blorgbeard Nov 24 '14 at 04:35
  • 1
    Implement spllchecking is possibly a bit too broad, but you can start here http://stackoverflow.com/questions/2344320/comparing-strings-with-tolerance – Alexei Levenkov Nov 24 '14 at 04:35
  • or look into [soundex](http://seesharpdeveloper.blogspot.com.au/2013/07/soundex-algorithm-in-c.html) and [levenshtein](http://www.techrepublic.com/blog/software-engineer/how-do-i-implement-the-soundex-function-in-c/) and the similar for c# – gwillie Nov 25 '14 at 02:59

1 Answers1

0

There are many algorithms available on internet that check similarity between two words. GetEdits is one of them.

The following code can be used. However it may not be very efficient.

static int GetEdits(string answer, string guess)
{
    guess = guess.ToLower();
    answer = answer.ToLower();

    int[,] d = new int[answer.Length + 1, guess.Length + 1];
    for (int i = 0; i <= answer.Length; i++)
        d[i, 0] = i;
    for (int j = 0; j <= guess.Length; j++)
        d[0, j] = j;
    for (int j = 1; j <= guess.Length; j++)
        for (int i = 1; i <= answer.Length; i++)
            if (answer[i - 1] == guess[j - 1])
                d[i, j] = d[i - 1, j - 1];  //no operation
            else
                d[i, j] = Math.Min(Math.Min(
                    d[i - 1, j] + 1,    //a deletion

                    d[i, j - 1] + 1),   //an insertion

                    d[i - 1, j - 1] + 1 //a substitution

                );
    return d[answer.Length, guess.Length];
}

static void Main(string[] args)
{
    const string text = @"lorem ipsum is simply dummy text of the printing and typesetting industry. Loren Ipsum has been the industrys standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing LorenIpsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of LoremIpsum.";

    var findWords = new string[]
    {
        "Lorem Ipsum",
        "Centuries",
        "Electronic"
    };

    const int MaxErrors = 2;

    // Tokenize text
    var tokens = text.Split(' ', ',' , '.');

    for (int i = 0; i < tokens.Length; i++)
    {
        if( tokens[i] != String.Empty)
        {
            foreach (var findWord in findWords)
            {
                if (GetEdits(findWord, tokens[i]) <= MaxErrors)
                {
                    Console.WriteLine(tokens[i]);
                    break;
                }
                // Join with the next word and check again.
                else if(findWord.Contains(' ') && i + 1 < tokens.Length)
                {
                    string token = tokens[i] + " " + tokens[i + 1];
                    if (GetEdits(findWord, token) <= MaxErrors)
                    {
                        Console.WriteLine(token);
                        i++;
                        break;
                    }
                }
            }
        }
    }
}
Usman Zafar
  • 1,919
  • 1
  • 15
  • 11