I am writing a recursive algorithm to build a finite state automaton by parsing a regular expression. The automaton iterates through the expression, pushing characters to a stack and operators to an "operator stack." When I encounter "(" (indicating a grouping operation), I push a "sub automaton" to the stack and pass the rest of the pattern to the sub automaton to parse. When that automaton encounters ")", it passes the rest of the string up to the parent automaton to finish parsing. Here is the code:
var NFA = function(par) {
this.stack = [];
this.op_stack = [];
this.parent = par;
};
NFA.prototype.parse = function(pattern) {
var done = false;
for(var i in pattern) {
if (done === true) {
break;
}
switch(pattern.charAt(i)) {
case "(":
var sub_nfa = new NFA(this);
this.stack.push(sub_nfa);
sub_nfa.parse(pattern.substring(i+1, pattern.length));
done = true;
break;
case ")":
if (this.parent !== null) {
var len = pattern.length;
/*TROUBLE SPOT*/
this.parent.parse(pattern.substring(i, pattern.length));
done = true;
break;
}
case "*":
this.op_stack.push(operator.KLEENE);
break;
case "|":
this.op_stack.push(operator.UNION);
break;
default:
if(this.stack.length > 0) {
//only push concat after we see at least one symbol
this.op_stack.push(operator.CONCAT);
}
this.stack.push(pattern.charAt(i));
}
}
};
Note the area marked "TROUBLE SPOT". Given the regular expression "(a|b)a", the call this.parent.parse, is called exactly once: when the sub-automaton encounters ")". At this point, pattern.substring(i, pattern.length) = ")a". This "works", but it isn't correct because I need to consume the ")" input before I pass the string to the parent automaton. However, if I change the call to this.parent.parse(pattern.substring(i+1, pattern.length), parse gets handed the empty string! I have tried stepping through the code and I cannot explain this behavior. What am I missing?
At Juan's suggestion, I made a quick jsfiddle to show the problem when trying to parse "(a|b)a" with this algorithm. In the ")" case, it populates an empty div with the substring at the i index and the substring at the i+1 index. It shows that while there are 2 characters in the substring at i, the substring at i+1 is the empty string! Here's the link: http://jsfiddle.net/XC6QM/1/
EDIT: I edited this question to reflect the fact that using charAt(i) doesn't change the behavior of the algorithm.