The following class parses through a very large string (an entire novel of text) and breaks it into consecutive 4-character strings that are stored as a Tuple. Then each tuple can be assigned a probability based on a calculation. I am using this as part of a monte carlo/ genetic algorithm to train the program to recognize a language based on syntax only (just the character transitions).
I am wondering if there is a faster way of doing this. It takes about 400ms to look up the probability of any given 4-character tuple. The relevant method _Probablity() is at the end of the class.
This is a computationally intensive problem related to another post of mine: Algorithm for computing the plausibility of a function / Monte Carlo Method
Ultimately I'd like to store these values in a 4d-matrix. But given that there are 26 letters in the alphabet that would be a HUGE task. (26x26x26x26). If I take only the first 15000 characters of the novel then performance improves a ton, but my data isn't as useful.
Here is the method that parses the text 'source':
private List<Tuple<char, char, char, char>> _Parse(string src)
{
var _map = new List<Tuple<char, char, char, char>>();
for (int i = 0; i < src.Length - 3; i++)
{
int j = i + 1;
int k = i + 2;
int l = i + 3;
_map.Add
(new Tuple<char, char, char, char>(src[i], src[j], src[k], src[l]));
}
return _map;
}
And here is the _Probability method:
private double _Probability(char x0, char x1, char x2, char x3)
{
var subset_x0 = map.Where(x => x.Item1 == x0);
var subset_x0_x1_following = subset_x0.Where(x => x.Item2 == x1);
var subset_x0_x2_following = subset_x0_x1_following.Where(x => x.Item3 == x2);
var subset_x0_x3_following = subset_x0_x2_following.Where(x => x.Item4 == x3);
int count_of_x0 = subset_x0.Count();
int count_of_x1_following = subset_x0_x1_following.Count();
int count_of_x2_following = subset_x0_x2_following.Count();
int count_of_x3_following = subset_x0_x3_following.Count();
decimal p1;
decimal p2;
decimal p3;
if (count_of_x0 <= 0 || count_of_x1_following <= 0 || count_of_x2_following <= 0 || count_of_x3_following <= 0)
{
p1 = e;
p2 = e;
p3 = e;
}
else
{
p1 = (decimal)count_of_x1_following / (decimal)count_of_x0;
p2 = (decimal)count_of_x2_following / (decimal)count_of_x1_following;
p3 = (decimal)count_of_x3_following / (decimal)count_of_x2_following;
p1 = (p1 * 100) + e;
p2 = (p2 * 100) + e;
p3 = (p3 * 100) + e;
}
//more calculations omitted
return _final;
}
}
EDIT - I'm providing more details to clear things up,
1) Strictly speaking I've only worked with English so far, but its true that different alphabets will have to be considered. Currently I only want the program to recognize English, similar to whats described in this paper: http://www-stat.stanford.edu/~cgates/PERSI/papers/MCMCRev.pdf
2) I am calculating the probabilities of n-tuples of characters where n <= 4. For instance if I am calculating the total probability of the string "that", I would break it down into these independent tuples and calculate the probability of each individually first:
[t][h]
[t][h][a]
[t][h][a][t]
[t][h] is given the most weight, then [t][h][a], then [t][h][a][t]. Since I am not just looking at the 4-character tuple as a single unit, I wouldn't be able to just divide the instances of [t][h][a][t] in the text by the total no. of 4-tuples in the next.
The value assigned to each 4-tuple can't overfit to the text, because by chance many real English words may never appear in the text and they shouldn't get disproportionally low scores. Emphasing first-order character transitions (2-tuples) ameliorates this issue. Moving to the 3-tuple then the 4-tuple just refines the calculation.
I came up with a Dictionary that simply tallies the count of how often the tuple occurs in the text (similar to what Vilx suggested), rather than repeating identical tuples which is a waste of memory. That got me from about ~400ms per lookup to about ~40ms per, which is a pretty great improvement. I still have to look into some of the other suggestions, however.