In short - I want to convert the first answer to the question here from Python into C#. My current solution to splitting conjoined words is exponential, and I would like a linear solution. I am assuming no spacing and consistent casing in my input text.
Background
I wish to convert conjoined strings such as "wickedweather" into separate words, for example "wicked weather" using C#. I have created a working solution, a recursive function using exponential time, which is simply not efficient enough for my purposes (processing at least over 100 joined words). Here the questions I have read so far, which I believe may be helpful, but I cannot translate their responses from Python to C#.
- How can I split multiple joined words?
- Need help understanding this Python Viterbi algorithm
- How to extract literal words from a consecutive string efficiently?
My Current Recursive Solution
This is for people who only want to split a few words (< 50) in C# and don't really care about efficiency.
My current solution works out all possible combinations of words, finds the most probable output and displays. I am currently defining the most probable output as the one which uses the longest individual words - I would prefer to use a different method. Here is my current solution, using a recursive algorithm.
static public string find_words(string instring)
{
if (words.Contains(instring)) //where words is my dictionary of words
{
return instring;
}
if (solutions.ContainsKey(instring.ToString()))
{
return solutions[instring];
}
string bestSolution = "";
string solution = "";
for (int i = 1; i < instring.Length; i++)
{
string partOne = find_words(instring.Substring(0, i));
string partTwo = find_words(instring.Substring(i, instring.Length - i));
if (partOne == "" || partTwo == "")
{
continue;
}
solution = partOne + " " + partTwo;
//if my current solution is smaller than my best solution so far (smaller solution means I have used the space to separate words fewer times, meaning the words are larger)
if (bestSolution == "" || solution.Length < bestSolution.Length)
{
bestSolution = solution;
}
}
solutions[instring] = bestSolution;
return bestSolution;
}
This algorithm relies on having no spacing or other symbols in the entry text (not really a problem here, I'm not fussed about splitting up punctuation). Random additional letters added within the string can cause an error, unless I store each letter of the alphabet as a "word" within my dictionary. This means that "wickedweatherdykjs" would return "wicked weather d y k j s" using the above algorithm, when I would prefer an output of "wicked weather dykjs".
My updated exponential solution:
static List<string> words = File.ReadLines("E:\\words.txt").ToList();
static Dictionary<char, HashSet<string>> compiledWords = buildDictionary(words);
private void btnAutoSpacing_Click(object sender, EventArgs e)
{
string text = txtText.Text;
text = RemoveSpacingandNewLines(text); //get rid of anything that breaks the algorithm
if (text.Length > 150)
{
//possibly split the text up into more manageable chunks?
//considering using textSplit() for this.
}
else
{
txtText.Text = find_words(text);
}
}
static IEnumerable<string> textSplit(string str, int chunkSize)
{
return Enumerable.Range(0, str.Length / chunkSize)
.Select(i => str.Substring(i * chunkSize, chunkSize));
}
private static Dictionary<char, HashSet<string>> buildDictionary(IEnumerable<string> words)
{
var dictionary = new Dictionary<char, HashSet<string>>();
foreach (var word in words)
{
var key = word[0];
if (!dictionary.ContainsKey(key))
{
dictionary[key] = new HashSet<string>();
}
dictionary[key].Add(word);
}
return dictionary;
}
static public string find_words(string instring)
{
string bestSolution = "";
string solution = "";
if (compiledWords[instring[0]].Contains(instring))
{
return instring;
}
if (solutions.ContainsKey(instring.ToString()))
{
return solutions[instring];
}
for (int i = 1; i < instring.Length; i++)
{
string partOne = find_words(instring.Substring(0, i));
string partTwo = find_words(instring.Substring(i, instring.Length - i));
if (partOne == "" || partTwo == "")
{
continue;
}
solution = partOne + " " + partTwo;
if (bestSolution == "" || solution.Length < bestSolution.Length)
{
bestSolution = solution;
}
}
solutions[instring] = bestSolution;
return bestSolution;
}
How I would like to use the Viterbi Algorithm
I would like to create an algorithm which works out the most probable solution to a conjoined string, where the probability is calculated according to the position of the word in a text file that I provide the algorithm with. Let's say the file starts with the most common word in the English language first, and on the next line the second most common, and so on until the least common word in my dictionary. It looks roughly like this
- the
- be
- and
- ...
- attorney
Here is a link to a small example of such a text file I would like to use. Here is a much larger text file which I would like to use
The logic behind this file positioning is as follows...
It is reasonable to assume that they follow Zipf's law, that is the word with rank n in the list of words has probability roughly 1/(n log N) where N is the number of words in the dictionary.
Generic Human, in his excellent Python solution, explains this much better than I can. I would like to convert his solution to the problem from Python into C#, but after many hours spent attempting this I haven't been able to produce a working solution. I also remain open to the idea that perhaps relative frequencies with the Viterbi algorithm isn't the best way to split words, any other suggestions for creating a solution using C#?