4

Imagine a simple grammar:

(a|ab)c

Which reads (a or ab) followed by c. The parse tree would look like this:

   and
   / \
  or  c
 /  \
a   ab

Now given it this input:

abc

We would traverse first down the left side of the tree, and match "a", then go back up a level. Since "a" matched, the "or" is also true, so move on to the "c". "c" does not match, and we hit the end of the road.

But there was an alternate path it could have taken; had we traversed down to "ab", we would have found a match.

So what I want to do for "or" nodes is essentially this:

  1. Evaluate left branch
  2. If left branch doesn't match, try right branch
  3. If left match does match, push current state on to stack so that we can continue from this point later if necessary

Then whenever the parser hits a dead end, I want to pop an item off the stack and continue from there again.

That's the part I can't figure out...how do I essentially save the current call stack? I can save the "ab" node in a stack so that I know I have to execute that one next, but then it still needs to know it needs to fall back up to the "or" afterwards.


I think Chris was on to something. We have to find a way to translate the tree such that it isn't necessary to jump across branches like that. For example, this equivalent parse tree doesn't have that problem:

     or
    /  \
 and   and
 / \    / \
a   c  ab  c

This time we parse down the left, hit "a", it passes, so we try the "c" node beside it, that fails, "and" fails, "or" has to try the right branch, ... "ab" passes, the other "c" passes, and then the whole expression passes.

Community
  • 1
  • 1
mpen
  • 272,448
  • 266
  • 850
  • 1,236
  • As a side note, the terminology here is kind of funny. "Stacks" are usually thought of stacking upwards, so when you pop an item off the stack, you fall down a level, but since I drew my tree the way I did, you're falling upwards. Perhaps I should flip the tree? `(ノಠ益ಠ)ノ彡┻━┻` – mpen Mar 16 '12 at 18:20

4 Answers4

2

You have the answer to your question in the way you posed it.

You need to save the state. The tricky part is identifying the state. Saving it is easy.

Your problem is that the parser "has a state" when it starts parsing some grammar rule. (This gets messier if you use an LALR parser, which merges the parsing of many rules into a single state). That state consists of:

  • the state of the input (e.g., where is the input stream?).
  • the state of the parse stack (what is the left context seen so far?)
  • where the parser should continue for success, and where to continue for failure

When you are parsing and face a choice alternative as you have described, you need to "save the state", run a trial parse on the first term. If successful, you can throw away the saved state and continue. If failure, restore the state, and try the 2nd (and nth alternatives). (Its easier to be brainless and just the save state regardless of whether you face an alternative, but that's up to you).

How can you save the state? Push it into a stack. (You typically have a parse stack, that's a pretty convenient place! If you don't like that, add another stack but you'll discover it and the parse stack in general move synchronously; I just make the parse stack contain a record with all the stuff I need, including space for the input. And you'll find the "call stack" convenient for parts of the state; see below).

The first thing is to save the input location; that is likely a file source position and for optimizing reasons likely a buffer index. That's just a scalar so it is pretty easy to save. Restoring the input stream may be harder; there's no gaurantee that the parser input scanner is anywhere near where it was. So you need to reposition the file, re-read any buffer, and reposition any input buffer pointer. Some simple checks can make this statistically cheap: store the file position of the first character of any buffer; then deteimining if you have to re-read the buffer is a matter of comparing the saved file position with the buffer start file position. The rest should be obvious.

You'll backtrack through fewer buffers (e.g, your parser runs faster) if you rearrange your grammar to minimize that. In your specific grammar, you have "(a | ab ) c", which could be re-written by hand to "a b? c". The latter will at least not backtrack across whatever a represents.

The odd part is saving the parse stack. Well, you don't have to, because your trial parse is only going to extend the parse stack you have, and restore it to the parse state you have whether your subparse succeeds or fails.

"where the parser goes on fail" and "where it goes on success" are just two more scalars. You can represent them as indexes of your parsing code blocks, and program counters (e.g., continuations) or as a return address on your call stack (see? another parallel stack!) followed by a conditional test to success/failure.

If you want some details on the latter, check out my SO answer on hand-coded recursive descent parsers.

If you start building trees, or doing something else as a side effect of the parse, you'll have to figure how to capture/save the state of the side-effected entity, and restore it. But whatever it is, you'll end up pushing it on a stack.

Community
  • 1
  • 1
Ira Baxter
  • 93,541
  • 22
  • 172
  • 341
  • "You can represent them as indexes of your parsing code blocks, and program counters (e.g., continuations) or as a return address on your call stack" -- I'm writing this specifically in C#; I don't think I have access to any program counter. But regardless if I could jump back to any line of code, I don't know how I can possibly restore the call stack. In my example, once "a" succeeds I can push the "ab node onto a stack so that I know I have to execute that node next if there's a failure down the road, and I can save the position of where I was at too...the part I'm specifically confused... – mpen Mar 17 '12 at 03:04
  • ...about is after I execute the "ab" node, where does the code flow to next? It would return to where-ever I called it from, which would be above the root node, because that's where I'd pop it off the stack. Perhaps I should have put some more emphasis on the C# part, or perhaps my current approach is unsalvagable. – mpen Mar 17 '12 at 03:05
  • OK,you're in C# and don't have access to the PC. SO use the third idea, e.g, a call with a boolean return (think of this as "the subparse succeeded/failed") and a conditional branch. Look at the hand-coded recursive descent answer, which does exactly this. – Ira Baxter Mar 17 '12 at 07:36
  • I think I figured out where I'm getting confused. You're suggesting I create a parser that's specific to my grammar; I was trying to create a parser that would essentially handle any grammar. What I really need to do is build a parser out of my grammar, and then use that parser to parse the input, not go straight from grammar to parsing (which I now suspect isn't possible for any sufficiently complex grammar). – mpen Mar 17 '12 at 07:59
  • @Mark: Yes, I am suggesting to build a parser that matches the grammar, because its pretty easy to do by hand, and can be automated pretty well. You can build an interpreter of a grammar that parses; it has the same problems with storing the state, and the solution is still the same. You can also build a parser that doesn't backtrack, so it doesn't have to do this state saving; see GLR parsers (these essentially compiler grammars for a parsing engine to process) or Early parsers (which will parse any context-free grammar without an interpreter.) http://en.wikipedia.org/wiki/Earley_parser – Ira Baxter Mar 17 '12 at 20:31
  • I think what I had was a context-free grammar. Worked for a little while until I realized my intended grammar isn't context-free :) – mpen Mar 17 '12 at 22:52
  • Don't confuse "context-free language" (most of them aren't) with "context-free grammar" (most of them are). There's always a context free grammar for a language, and what parser generators do is help you recognize the context-free part. Its up to you the rest of the front end/compiler to detect violations of the non-context free part. (That's why most compilers have symbol tables). – Ira Baxter Mar 18 '12 at 19:51
1

What you should do is is call a method for each possibility. If you hit a dead end, you can return, and you will be right back where you were, and ready to try the next option.

You can indicate whether you successfully parsed the branch by returning a value from the parsing method. For example, you could return true for success and false for failure. In this case, if the parse returns false, you try the next option.

Kendall Frey
  • 43,130
  • 20
  • 110
  • 148
  • This is the problem Kendall. In my example, I go down the left side of the tree, and I hit "a". "a" *succeeds* so I *don't* try the next option, and thus I've lost my place by the time I hit "c". I don't know how to get back to where I was. – mpen Mar 16 '12 at 17:07
  • You return from the method that parsed the left side, returning `true`, or maybe the parse tree for that branch. – Kendall Frey Mar 16 '12 at 17:13
  • Huh? When I parse "a" it returns true, returning to "or". "or" sees that "a" is true and thus also returns true, returning back to "and". "and" then evaluates the other branch, "c". "c" fails (returns false). Returning from "c" goes back to "and" (which then also fails). Can you elaborate on how I can return from "c" to "or"? – mpen Mar 16 '12 at 18:15
  • You should rewrite your grammar to eliminate that problem. You can use `ab?c` where ? indicates an optional component. – Kendall Frey Mar 16 '12 at 18:19
  • No. This was a primitive example. Unless you can prove how all "or" statements can be replaced with something followed by an optional term... this isn't possible. – mpen Mar 16 '12 at 18:21
  • Not all 'or' statements need to be. For example, `(a|b)c` is fine the way it is. – Kendall Frey Mar 16 '12 at 18:25
  • Hrm... even if that could work, I'd need an algorithm for finding the longest prefix between 2 "or" options, to keep my grammars simple. Which would be doable if all the elements were strings, but that isn't necessarily so. They could be regexes, or other complex nodes for which you can't easily break up like that. – mpen Mar 16 '12 at 18:56
  • You don't have to do this for all cases; its an optimization (a pretty good one). – Ira Baxter Mar 17 '12 at 07:37
1

Just return your state in addition to your result. Lets take a simple example where you can have an index for each element:

Grammer: (a|ab)c
Translated: AND(OR(a,ab),c)
Input: abc

Call AND
Call OR
a matches, return true,1
c does not match, start over
Call OR giving 1
ab matches, return true,2
c matches, return true

You will need a more complex structure to handle harder cases (whether it be a queue or stack is up to how you build and destruct in recreating your state)

Guvante
  • 18,775
  • 1
  • 33
  • 64
  • You're saying I should restart from the very beginning each time I hit a dead end, but take a different path through? Sounds plausible, but clunky and inefficient. What if I have a really deeply nested tree? The first half or 90% of the path might be exactly the same. Plus...it would be very tricky to encode which path to take. I might travel through the same node 5 times, but only on the 5th pass do I want to fork in a different direction... – mpen Mar 16 '12 at 22:25
  • Your second pass will be pretty cheap, if the exact same child doesn't fail, then your own answer doesn't change either. If it fails (leafs would fail on the second pass every time for instance) then try your next child (if that is what your logic allows). My logic is really trying to duplicate your "save stack" idea, just a little more concrete on what you would do (saving the actual stack is both complicated and incredibly expensive, who knows how deep it goes). Obviously there are other algorithms that could be used, you could look into how regex parsers work for instance. – Guvante Mar 16 '12 at 22:30
  • In that case, I'd have to cache the results. *Most* of the operations are pretty cheap though (checking if about 5 characters match), so... I don't think that would yield much benefit. I think the biggest cost would be the iteration, especially if there's a lot of backtracking. Saving the stack, expensive? In terms of memory? Rather save CPU time than memory. But anyway, I'm mostly interested in a clean and easy to implement solution right now. – mpen Mar 17 '12 at 02:43
0

you need to use recursion.

Something like:

in or statement for each statement bool ret = eval(statement) if (ret) bool recVal = call the recursion if (recVal) than you find a path stop the recursion. else we continue in another loop and try next statement.

Horonchik
  • 477
  • 2
  • 11
  • You don't need recursion. You need a stack. You can get one implicitly by recursing. – Ira Baxter Mar 16 '12 at 16:39
  • I am using recursion, but it's only useful for moving back *up* the tree. What I need to do is go from "c" back to "ab". – mpen Mar 16 '12 at 17:08
  • you are correct, but after you fail in "c" you go back the tree to whos call it its "a" and you try the next one in the or statement "ab" – Horonchik Mar 16 '12 at 18:47
  • @Horonchik: "a" doesn't call "c". "and" calls "c". That's how trees work. – mpen Mar 16 '12 at 18:58