Implementing "?" (lazy "") regexp pattern in combinatorial GLR parsers

Question

I have implemented combinatorial GLR parsers. Among them there are:

char(·) parser which consumes specified character or range of characters.
many(·) combinator which repeats specified parser from zero to infinite times.

Example: "char('a').many()" will match a string with any number of "a"-s.

But many(·) combinator is greedy, so, for example, char('{') >> char('{') >> char('a'..'z').many() >> char('}') >> char('}') (where ">>" is sequential chaining of parsers) will successfully consume the whole "{{foo}}some{{bar}}" string.

I want to implement the lazy version of many(·) which, being used in previous example, will consume "{{foo}}" only. How can I do that?

Edit:

May be I confused ya all. In my program a parser is a function (or "functor" in terms of C++) which accepts a "step" and returns forest of "steps". A "step" may be of OK type (that means that parser has consumed part of input successfully) and FAIL type (that means the parser has encountered error). There are more types of steps but they are auxiliary.

Parser = f(Step) -> Collection of TreeNodes of Steps.

So when I parse input, I:

Compose simple predefined Parser functions to get complex Parser function representing required grammar.
Form initial Step from the input.
Give the initial Step to the complex Parser function.
Filter TreeNodes with Steps, leaving only OK ones (or with minimum FAIL-s if there were errors in input).
Gather information from Steps which were left.

It still isn't really clear to me just how GLR your implementation is. A traditional GLR parser has a monolithic parser table that incorporates information from the whole grammar. Are you doing that? — Jason Orendorff, Dec 09 '10 at 15:44

Ira Baxter · Answer 1 · 2010-12-09T14:54:47.990

I have implemented and have been using GLR parsers for 15 years as language front ends for a program transformation system.

I don't know what a "combinatorial GLR parser" is, and I'm unfamiliar with your notation so I'm not quite sure how to interpret it. I assume this is some kind of curried function notation? I'm imagining your combinator rules are equivalent to definining a grammer in terms of terminal characters, where "char('a').many" corresponds to grammar rules:

 char = "a" ;
 char = char "a" ;

GLR parsers, indeed, produce all possible parses. The key insight to GLR parsing is its psuedo-parallel processing of all possible parses. If your "combinators" can propose multiple parses (that is, they produce grammar rules sort of equivalent to the above), and you indeed have them connected to a GLR parser, they will all get tried, and only those sequences of productions that tile the text will survive (meaning all valid parsess, e.g., ambiguous parses) will survive.

If you have indeed implemented a GLR parser, this collection of all possible parses should have been extremely clear to you. The fact that it is not hints what you have implemented is not a GLR parser.

Error recovery with a GLR parser is possible, just as with any other parsing technology. What we do is keep the set of live parses before the point of the error; when an error is found, we try (in psuedo-parallel, the GLR parsing machinery makes this easy if it it bent properly) all the following: a) deleting the offending token, b) inserting all tokens that essentially are FOLLOW(x) where x is live parse. In essence, delete the token, or insert one expected by a live parse. We then turn the GLR parser loose again. Only the valid parses (e.g., repairs) will survive. If the current token cannot be processed, the parser processing the stream with the token deleted survives. In the worst case, the GLR parser error recovery ends up throwing away all tokens to EOF. A serious downside to this is the GLR parser's running time grows pretty radically while parsing errors; if there are many in one place, the error recovery time can go through the roof.

About notation: "combinatorial" means I have sets of simple parsers ("char(...)" parser, which expects specified char, "end()", which expects end of text and some others) and their combinators (e. g., x.many() returns parser which repeats x as many times as possible). All simple parsers are GLR in fact, and their combinations are also GLR. — Lavir the Whiolet, Dec 09 '10 at 08:13
The problem is that I want to construct GLR parser which is equivalent to regular regexp "(something)*?(some other thing)". I don't know how to make GLR parser to stop repeating "something" when "some other thing" is encountered. Darius Bacon suggested to mark appropriate parses whether they are produced by eager or lazy combinator, but I don't know how reasonable error recovery can be added to such an algorithm. — Lavir the Whiolet, Dec 09 '10 at 08:24
Its pretty unclear what you mean when you say "all simple parsers are GLR". Your simple parsers appear to be just "is_member_character_set(char)" which is trivial to check. How is that GLR? Where are the psuedo-parallel parses with shared forests? — Ira Baxter, Dec 09 '10 at 14:49
I'm writing parser with error recovery. Yes, "char(...)" parser is trivial... when there are no errors in input. When it encounters an error it splits into 2-3 alternatives (with different ways of error recovering). — Lavir the Whiolet, Dec 09 '10 at 18:59

score 1 · Answer 2 · answered Dec 06 '10 at 11:39

1

Won't a GLR parser produce all possible parses of the input? Then resolving the ambiguity is a matter of picking the parse you prefer. To do that, I suppose the elements of the parse forest need to be labeled according to what kind of combinator produced them, eager or lazy. (You can't resolve the ambiguity incrementally before you've seen all the input, in general.)

(This answer based on my dim memory and vague possible misunderstanding of GLR parsing. Hopefully someone expert will come by.)

answered Dec 06 '10 at 11:39

Darius Bacon

14,921
5
53
53

I also want to implement automatic error recovery. What if there are errors in input? Will such an algorithm recover from errors reasonably? – Lavir the Whiolet Dec 06 '10 at 12:41
Voting mainly for "You can't resolve the ambiguity incrementally before you've seen all the input, in general" part. :) – Lavir the Whiolet Dec 09 '10 at 19:51

score 0 · Answer 3 · answered Sep 19 '14 at 15:42

Non-greedy functionality is nothing more than a disambiguation mechanism. If you truly have a generalized parser (which does not require disambiguation to produce its results), then "non-greedy" is meaningless; the same results will be returned whether or not an operator is "non-greedy".

Non-greedy disambiguation behavior could be applied to the complete set of results provided by a generalized parser. Working left-to-right, filter the ambiguous sub-groups corresponding to a non-greedy operator to use the shortest match which still led to a successful parse of the remaining input.

score 0 · Accepted Answer · answered Dec 09 '10 at 15:42

0

Consider the regular expression <.*?> and the input <a>bc<d>ef. This should find <a>, and no other matches, right?

Now consider the regular expression <.*?>e with the same input. This should find <a>bc<d>e, right?

This poses a dilemma. For the user's sake, we want the behavior of the combinator >> to be understood in terms of its two operands. Yet there is no way to produce the second parser's behavior in terms of what the first one finds.

One answer is for each parser to produce a sequence of all parses, ordered by preference, rather than the unordered set of all parsers. Greedy matching would return matches sorted longest to shortest; non-greedy, shortest to longest.

answered Dec 09 '10 at 15:42

Jason Orendorff

42,793
6
62
96

The problem with this approach is that in case of errors in input such "*?" combinator becomes greedy. For example, parser with grammar " '<' (any char except 'X')*? '>' " applied to "<...>.X.<...>...<...>" will match "<...>.X.<...>" and will not report any error. And unfortunately it will be right. – Lavir the Whiolet Dec 09 '10 at 19:39
In a few words it is *on principle* impossible to implement error-recovery-friendly "*?" combinator which will not require explicit directions of what the "...*?" part is followed by (and example from my question should be rewritten as follows: " char('{') >> char('{') >> char('a'..'z').many().followed_by( char('}') >> char('}') ) "). Fix me if I am not right (that it's impossible *on principle*). – Lavir the Whiolet Dec 09 '10 at 19:49
In response to your first comment, I think that has only one parse, matching the first "<...>". The grammar seems to clearly rule out anything containing an X. So both the greedy parser and the non-greedy one should return ["<...>"] (an array of just 1 parse). – Jason Orendorff Dec 09 '10 at 20:44
@Lavir And to your second comment, yes, my answer is trying to say that it is impossible in principle to implement `*?` without bending the model at least a little. However there are probably many satisfactory ways to do so. All real-world parsers bend the base conceptual model somewhat. One of the strengths of combinatorial parsers (and recursive descent parsers), over table-driven ones, is that they are so flexible. – Jason Orendorff Dec 09 '10 at 20:49

Implementing "*?" (lazy "*") regexp pattern in combinatorial GLR parsers

4 Answers4

Implementing "?" (lazy "") regexp pattern in combinatorial GLR parsers