Assuming the words in the dictionary do not contain each other (e.g. "TOO" and "TOOK"), I fail to see why this problem requires a solution that is any more complicated than this one-line function:
static public List<string> Normalize(string input, List<string> dictionary)
{
return dictionary.Where(a => input.Contains(a)).ToList();
}
(If the words DO contain each other, see below.)
Full example:
using System;
using System.Linq;
using System.Collections.Generic;
public class Program
{
static public List<string> Normalize(string input, List<string> dictionary)
{
return dictionary.Where(a => input.Contains(a)).ToList();
}
public static void Main()
{
List<string> dictionary = new List<string>
{
"COMPUTER","FIVE","CODE","COLOR","FOO"
};
string input = "COMPUTERFIVECODECOLORBAR";
var normalized = Normalize(input, dictionary);
foreach (var s in normalized)
{
Console.WriteLine(s);
}
}
}
Output:
COMPUTER
FIVE
CODE
COLOR
Code on DotNetFiddle
On the other hand, if you've determined that your keywords DO in fact overlap, you're not totally out of luck. If you are certain that the input string contains only words that are in the dictionary, and that they are continguous, you can use a more complicated function.
static public List<string> Normalize2(string input, List<string> dictionary)
{
var sorted = dictionary.OrderByDescending( a => a.Length).ToList();
var results = new List<string>();
bool found = false;
do
{
found = false;
foreach (var s in sorted)
{
if (input.StartsWith(s))
{
found = true;
results.Add(s);
input = input.Substring(s.Length);
break;
}
}
}
while (input != "" && found);
return results;
}
public static void Main()
{
List<string> dictionary = new List<string>
{
"SHORT","LONG","LONGER","FOO","FOOD"
};
string input = "FOODSHORTLONGERFOO";
var normalized = Normalize2(input, dictionary);
foreach (var s in normalized)
{
Console.WriteLine(s);
}
}
The way this works is that it only looks at the beginning of the string and looks for the longest keywords first. When one is found, it removes it from the input string and continues searching.
Output:
FOOD
SHORT
LONGER
FOO
Notice that "LONG" is not included because we included "LONGER", but "FOO" is included because it is in the string separate from "FOOD".
Also, with this second solution, the keywords will appear in the results dictionary in the same order they appeared in the original string. So if the requirement was to actually split the phrase rather than just detect the keywords in any order, you should use the second function.
Code