Looking for a way to optimize this algorithm for parsing a very large string

Question

The following class parses through a very large string (an entire novel of text) and breaks it into consecutive 4-character strings that are stored as a Tuple. Then each tuple can be assigned a probability based on a calculation. I am using this as part of a monte carlo/ genetic algorithm to train the program to recognize a language based on syntax only (just the character transitions).

I am wondering if there is a faster way of doing this. It takes about 400ms to look up the probability of any given 4-character tuple. The relevant method _Probablity() is at the end of the class.

This is a computationally intensive problem related to another post of mine: Algorithm for computing the plausibility of a function / Monte Carlo Method

Ultimately I'd like to store these values in a 4d-matrix. But given that there are 26 letters in the alphabet that would be a HUGE task. (26x26x26x26). If I take only the first 15000 characters of the novel then performance improves a ton, but my data isn't as useful.

Here is the method that parses the text 'source':

    private List<Tuple<char, char, char, char>> _Parse(string src)
    {
        var _map = new List<Tuple<char, char, char, char>>(); 

        for (int i = 0; i < src.Length - 3; i++)
        {
          int j = i + 1;
          int k = i + 2;
          int l = i + 3;

          _map.Add
            (new Tuple<char, char, char, char>(src[i], src[j], src[k], src[l])); 
        }

        return _map; 
    }

And here is the _Probability method:

    private double _Probability(char x0, char x1, char x2, char x3)
    {
        var subset_x0 = map.Where(x => x.Item1 == x0);
        var subset_x0_x1_following = subset_x0.Where(x => x.Item2 == x1);
        var subset_x0_x2_following = subset_x0_x1_following.Where(x => x.Item3 == x2);
        var subset_x0_x3_following = subset_x0_x2_following.Where(x => x.Item4 == x3);

        int count_of_x0 = subset_x0.Count();
        int count_of_x1_following = subset_x0_x1_following.Count();
        int count_of_x2_following = subset_x0_x2_following.Count();
        int count_of_x3_following = subset_x0_x3_following.Count(); 

        decimal p1;
        decimal p2;
        decimal p3;

        if (count_of_x0 <= 0 || count_of_x1_following <= 0 || count_of_x2_following <= 0 || count_of_x3_following <= 0)
        {
            p1 = e;
            p2 = e;
            p3 = e;
        }
        else
        {
            p1 = (decimal)count_of_x1_following / (decimal)count_of_x0;
            p2 = (decimal)count_of_x2_following / (decimal)count_of_x1_following;
            p3 = (decimal)count_of_x3_following / (decimal)count_of_x2_following;

            p1 = (p1 * 100) + e; 
            p2 = (p2 * 100) + e;
            p3 = (p3 * 100) + e; 
        }

        //more calculations omitted

        return _final; 
    }
}

EDIT - I'm providing more details to clear things up,

1) Strictly speaking I've only worked with English so far, but its true that different alphabets will have to be considered. Currently I only want the program to recognize English, similar to whats described in this paper: http://www-stat.stanford.edu/~cgates/PERSI/papers/MCMCRev.pdf

2) I am calculating the probabilities of n-tuples of characters where n <= 4. For instance if I am calculating the total probability of the string "that", I would break it down into these independent tuples and calculate the probability of each individually first:

[t][h]

[t][h][a]

[t][h][a][t]

[t][h] is given the most weight, then [t][h][a], then [t][h][a][t]. Since I am not just looking at the 4-character tuple as a single unit, I wouldn't be able to just divide the instances of [t][h][a][t] in the text by the total no. of 4-tuples in the next.

The value assigned to each 4-tuple can't overfit to the text, because by chance many real English words may never appear in the text and they shouldn't get disproportionally low scores. Emphasing first-order character transitions (2-tuples) ameliorates this issue. Moving to the 3-tuple then the 4-tuple just refines the calculation.

I came up with a Dictionary that simply tallies the count of how often the tuple occurs in the text (similar to what Vilx suggested), rather than repeating identical tuples which is a waste of memory. That got me from about ~400ms per lookup to about ~40ms per, which is a pretty great improvement. I still have to look into some of the other suggestions, however.

26^4 isn't so bad. It's only 456,976. Even if you stored each count as a 64-bit quantity, it's still just 3.5MB. Practically nothing for computers of today. Have you tried this approach? — Vilx-, Sep 19 '11 at 07:12
Hmm, true but the main obstacle I am encountering is that it takes about ~400ms to calculate and assign a value to each cell in the matrix. So if there are 456,976 values to assign, thats like ~50 hours. I think thats right? I can get the 400ms down to about 1ms but only b/c I look at a much smaller sample, like the first 15,000 characters of the novel only. — Sean Thoman, Sep 19 '11 at 07:15
I don't know the calculations required, but couldn't you just go through the text once, character-by-character, incrementing the corresponding cell on each char? Then in each cell you have the count of that particular 4-char combination. Afterwards you divide every cell by the total count of 4-char combinations (which will always be `total_number_of_chars_in_text - 3`). You can even avoid the division up front and just divide each time you need the probability for a specific combination. — Vilx-, Sep 19 '11 at 07:32
You say: "But given that there are 26 letters in the alphabet...". You have to be careful there. At that point, you can be almost sure that your text is English =) — Jens, Sep 19 '11 at 07:33
+1 to that. 26 only works for English texts. And you should be careful to avoid punctuation characters, spaces, newlines, etc. — Vilx-, Sep 19 '11 at 07:36
If you're going to do anything so specific and performance critical, I'd write it in C++ and use P/Invoke or c++/clr wrapper. My $0.02 — sehe, Sep 19 '11 at 07:44
Just a note but your 'speed problem' appears to have nothing to do with parsing. — H H, Sep 19 '11 at 08:08
@sehe - It seems to me that the author has simply chosen the wrong approach and it can be brought down to a very good performance even in entirely managed code. — Vilx-, Sep 19 '11 at 09:34
@Vilx: I wholly agree. However, I said what I'd do :) It seems to be functionality easily isolated so I'd do that. — sehe, Sep 19 '11 at 09:45
@Vilx, your exact suggestion wouldn't work for me, but I am using something along those lines now. See my edit for details. — Sean Thoman, Sep 19 '11 at 19:17
What wouldn't work, apart from the foreign alphabets? If you're worried about the tuples with less than 4 characters (like [t] and [t][h]), then just add a special 27th symbol - [blank]. Your [t] then becomes [blank][blank][blank][t]. This brings up the total elements in the 4D array to 531,441, which still isn't very much. Alternatively you can also make 1D, 2D and 3D arrays for shorter tuples. In any case, it will be just like the dictionary, except the access time is guaranteed O(1). This will probably make it twice as fast, at least. — Vilx-, Sep 19 '11 at 20:30
@Vilx, I'm not sure if were on the same page, but maybe I dont fully understand what youre suggesting. Given a tuple {x0,x1,x2,x3}, what I am calculating is essentially 1) Given the first character x0, what is the probability that the next character will be x1, 2) Given the first and second characters {x0, x1} what is the probability that the third character will be x2, and finally 3) Given the first, second, and third characters { x0, x1, x2 }, what is the probability that the fourth character will be x3. Of course there are many ways to process any n-tuple, but this is what im doing. — Sean Thoman, Sep 19 '11 at 20:44
Just to add to the previous comment -- its important (I think?) to leave steps 1), 2), and 3) seperate because they give a different 'kind' of score to any given string. For instance if you went through a string looking only at {x0, x1} pairs, you'd have a score based on first-order transitions only, whereas looking at higher-order transitions could push the scoring of that string in a different direction. Ie, a string could 'seem' like English on a first-order transitional basis but be complete gibberish, hence I think I should refine the score based on higher order transitions. — Sean Thoman, Sep 19 '11 at 20:54

score 1 · Answer 1 · answered Sep 19 '11 at 07:15

1

I would suggest changing the data structure to make that faster...

I think a Dictionary<char,Dictionary<char,Dictionary<char,Dictionary<char,double>>>> would be much more efficient since you would be accessing each "level" (Item1...Item4) when calculating... and you would cache the result in the innermost Dictionary so next time you don't have to calculate at all..

answered Sep 19 '11 at 07:15

Yahia

69,653
9
115
144

Instead of a 3 level deep nested dictionary, I would just keep the list as is, and make 4 flat `ILookup` object indexing by every tuple member. – Aviad P. Sep 19 '11 at 07:18
I'll try the dictionary. @Aviad, could you elaborate on the ILookup strategy? – Sean Thoman Sep 19 '11 at 07:21

Rune FS · Accepted Answer · 2011-09-21T06:59:59.740

In yoiu probability method you are iterating the map 8 times. Each of your wheres iterates the entire list and so does the count. Adding a .ToList() ad the end would (potentially) speed things. That said I think your main problem is that the structure you've chossen to store the data in is not suited for the purpose of the probability method. You could create a one pass version where the structure you store you're data in calculates the tentative distribution on insert. That way when you're done with the insert (which shouldn't be slowed down too much) you're done or you could do as the code below have a cheap calculation of the probability when you need it.

As an aside you might want to take puntuation and whitespace into account. The first letter/word of a sentence and the first letter of a word gives clear indication on what language a given text is written in by taking punctuaion charaters and whitespace as part of you distribution you include those characteristics of the sample data. We did that some years back. Doing that we shown that using just three characters was almost as exact (we had no failures with three on our test data and almost as exact is an assumption given that there most be some weird text where the lack of information would yield an incorrect result). as using more (we test up till 7) but the speed of three letters made that the best case.

EDIT

Here's an example of how I think I would do it in C#

class TextParser{
        private Node Parse(string src){
            var top = new Node(null);

            for (int i = 0; i < src.Length - 3; i++){
                var first = src[i];
                var second = src[i+1];
                var third = src[i+2];
                var fourth = src[i+3];

                var firstLevelNode = top.AddChild(first);
                var secondLevelNode = firstLevelNode.AddChild(second);
                var thirdLevelNode = secondLevelNode.AddChild(third);
                thirdLevelNode.AddChild(fourth);
            }

            return top;
        }
    }

    public class Node{
        private readonly Node _parent;
        private readonly Dictionary<char,Node> _children 
                         = new Dictionary<char, Node>();
        private int _count;

        public Node(Node parent){
            _parent = parent;
        }

        public Node AddChild(char value){
            if (!_children.ContainsKey(value))
            {
                _children.Add(value, new Node(this));
            }
            var levelNode = _children[value];
            levelNode._count++;
            return levelNode;
        }
        public decimal Probability(string substring){
            var node = this;
            foreach (var c in substring){
                if(!node.Contains(c))
                    return 0m;
                node = node[c];
            }
            return ((decimal) node._count)/node._parent._children.Count;
        }

        public Node this[char value]{
            get { return _children[value]; }
        }
        private bool Contains(char c){
            return _children.ContainsKey(c);
        }
    }

the usage would then be:

var top = Parse(src);
top.Probability("test");

score 1 · Answer 3 · answered Sep 19 '11 at 07:50

Ok, I don't have time to work out details, but this really calls for

neural classifier nets (Just take any off the shelf, even the Controllable Regex Mutilator would do the job with way more scalability) -- heuristics over brute force
you could use tries (Patricia Tries a.k.a. Radix Trees to make a space optimized version of your datastructure that can be sparse (the Dictionary of Dictionaries of Dictionaries of Dictionaries... looks like an approximation of this to me)

score 1 · Answer 4 · answered Sep 19 '11 at 08:02

There's not much you can do with the parse function as it stands. However, the tuples appear to be four consecutive characters from a large body of text. Why not just replace the tuple with an int and then use the int to index the large body of text when you need the character values. Your tuple based method is effectively consuming four times the memory the original text would use, and since memory is usually the bottleneck to performance, it's best to use as little as possible.

You then try to find the number of matches in the body of text against a set of characters. I wonder how a straightforward linear search over the original body of text would compare with the linq statements you're using? The .Where will be doing memory allocation (which is a slow operation) and the linq statement will have parsing overhead (but the compiler might do something clever here). Having a good understanding of the search space will make it easier to find an optimal algorithm.

But then, as has been mentioned in the comments, using a 26⁴ matrix would be the most efficent. Parse the input text once and create the matrix as you parse. You'd probably want a set of dictionaries:

SortedDictionary <int,int> count_of_single_letters; // key = single character
SortedDictionary <int,int> count_of_double_letters; // key = char1 + char2 * 32
SortedDictionary <int,int> count_of_triple_letters; // key = char1 + char2 * 32 + char3 * 32 * 32
SortedDictionary <int,int> count_of_quad_letters;   // key = char1 + char2 * 32 + char3 * 32 * 32 + char4 * 32 * 32 * 32

Finally, a note on data types. You're using the decimal type. This is not an efficient type as there is no direct mapping to CPU native type and there is overhead in processing the data. Use a double instead, I think the precision will be sufficient. The most precise way will be to store the probability as two integers, the numerator and denominator and then do the division as late as possible.

score 1 · Answer 5 · answered Sep 19 '11 at 11:45

The best approach here is to using sparse storage and pruning after each each 10000 character for example. Best storage strucutre in this case is prefix tree, it will allow fast calculation of probability, updating and sparse storage. You can find out more theory in this javadoc http://alias-i.com/lingpipe/docs/api/com/aliasi/lm/NGramProcessLM.html

Looking for a way to optimize this algorithm for parsing a very large string

5 Answers5

Linked