How to handle multiple rules for one token with PLY

Question

I'm working with a jison file and converting it to a parser generator using the lex module from python PLY.

I've noticed that in this jison file, certain tokens have multiple rules associated with them. For example, for the token CONTENT, the file specifies the following three rules:

[^\x00]*?/("{{")                 {
                                   if(yytext.slice(-2) === "\\\\") {
                                     strip(0,1);
                                     this.begin("mu");
                                   } else if(yytext.slice(-1) === "\\") {
                                     strip(0,1);
                                     this.begin("emu");
                                   } else {
                                     this.begin("mu");
                                   }
                                   if(yytext) return 'CONTENT';
                                 }

[^\x00]+                         return 'CONTENT';

// marks CONTENT up to the next mustache or escaped mustache
<emu>[^\x00]{2,}?/("{{"|"\\{{"|"\\\\{{"|<<EOF>>) {
                                   this.popState();
                                   return 'CONTENT';
                                 }

In another case, there are multiple rules for the COMMENT token:

<com>[\s\S]*?"--}}"              strip(0,4); this.popState(); return 'COMMENT';
<mu>"{{!--"                      this.popState(); this.begin('com');
<mu>"{{!"[\s\S]*?"}}"            strip(3,5); this.popState(); return 'COMMENT';

It seems easy enough to distinguish the rules when they apply to different states, but what about when they apply to the same state?

How can I translate this jison to python rules using ply.lex?

edit

In case it helps, this jison file is part of the handlebars.js source code. See: https://github.com/wycats/handlebars.js/blob/master/src/handlebars.l

score 0 · Answer 1 · answered Mar 23 '15 at 23:18

This question is difficult to answer; it is also two questions in one.

Jison (that's the language that the handlebars parser is written in, not bison) has some features not found in other lexers, and in particular not found in PLY. This makes it difficult to convert the lexical code you have shown from Jison to PLY. However, this is not the question you were focussed on. It is possible to answer your base question, how can multiple regular expressions return a single token in PLY, but this would not give you the solution to implementing the code you chose as your example!

First, lets address the question you asked. Returning one token for multiple regular expressions in PLY can be accomplished by the @TOKEN decorator in PLY as shown in the PLY manual (section 4.11).

For example, we can do the following:

comment1 = r'[^\x00]*?/("{{")'
comment2 = r'[^\x00]+'
comment = r'(' + comment1 + r'|' + comment2 + r')'

@TOKEN(comment)
def t_COMMENT(t)
 ....

However, this won't really work for the rules you have from jison as they use a new feature of jison called start conditions (see the Jison Manual). Here, the phrase this.begin is used to introduce a state name, which can then be used elsewhere in a pattern. This is where the <mu>, <emu> and <com> come from. There is no feature like this in PLY.

To match these lexemes, it is really necessary to back to the syntax of the handlebars/moustache language/notation and create new regular expressions. Somehow I fell that completely re-implementing the whole of handlebars for you in a SO answer is perhaps a step too far.

However, I have identified the steps to a solution for you, and anyone else who treads this path.

How to handle multiple rules for one token with PLY

1 Answers1

Linked