0

As noted in the Wikipedia article on Parsing, the process has three stages:

  1. Lexical analysis (tokenization): Convert Unicode code points to tokens
  2. Syntactic analysis: Verify that token stream form valid Script / Module, and create Parse Tree
  3. Semantic analysis: Additional verification of tokens (happens after Parse Tree is created?)

Other than the small confusion in stage (3) above, I wanted to verify that my understanding of the process is correct for ECMAScript.

Thus, is the below flow correct?


Lexical Analysis Phase (ECMAScript Clause 11)

  • Input: Stream of Unicode code points <-- Terminal symbols in the lexical grammar
  • Output: Valid tokens <-- Nonterminal symbols in the lexical grammar
  • Application of grammar
    1. Each Unicode code point (character) is analysed, one at the time
    2. Longest possible sequence of terminal symbols are replaced with nonterminal symbol, by applying suitable production rule
    3. Then, longest possible sequence of nonterminal symbols are replaced, again by applying production rules
    4. In the same way, production rules are applied again and again, all way until "goal symbol(s)" are produced
  • Goal symbols are input elements (aka. tokens), for the syntactic analysis phase (next phase)
  • Multiple "goal symbols" exist for ECMAScript's lexical grammar (spec states which to pick)

Syntactic Analysis Phase (ECMAScript Clause 12-15)

  • Input: Stream of tokens <-- Terminal symbols in syntactic grammar
  • Output: Parse Tree, with Script|Module as root Parse Node <-- Nonterminal symbol in syntactic grammar
  • Application of grammar
    1. Start with stream of input elements, aka. tokens
    2. These tokens are terminal symbols in the syntactic grammar
    3. Apply production rules by matching maximum stream of symbols with RHS of a suitable production rule, then replacing stream with LHS nonterminal symbol of that rule
    4. This continues until only "goal symbol" is left
  • ECMAScript: Program is valid if we can replace all terminal symbols (tokens), to end with the single "goal symbol" (Script | Module)
Magnus
  • 6,791
  • 8
  • 53
  • 84

1 Answers1

4

The syntactic parsing does not obey the "maximal munch" rule (select the longest matching prefix). In fact, as far as I know ECMA-262 does not specify a parsing algorithm, but does provide an unambiguous context-free grammar which can be parsed, for example with a bottom-up (LR(k)) parser, aside from some issues dealing with automatic semicolon insertion and some restrictions on productions which span a newline (which is not a syntactic token).

However, as mentioned in §5.1.4, the grammar actually recognises a superset of the language; additional restrictions are provided in the form of supplementary grammars.

One clarification: The complexities related to having multiple context-dependent lexical goal symbols make it difficult to first divide the input into lexemes and only then combine the lexemes into a parse tree. It is impossible to know the correct lexical goal symbol at each point without at least a partial parse, so it is convenient to interleave the syntactic and lexical parses. Practical parsing algorithms operate from left to right, processing lexemes basically in input order, so it is possible to do lexical analysis on demand, only finding a lexeme when the parser needs more input to continue.

But aside from that, the overall structure you outline is correct. In the lexical parse, the longest possible prefix of terminals (characters) are aggregated into a non-terminals to create a lexeme (according to slightly complicated rules about which lexical goal is required); in the syntactic parse, terminals (lexemes) are aggregated into non-terminals to produce a single parse tree corresponding to one of two syntactic goal symbols.

As is often the case with real-world languages, the reality is not quite as clean as that. Aside from the need for the parser to indicate which lexical goal is required, there are also the newline rules and automatic semicolon insertion, both of which cross the boundary between lexical and syntactic parsing.


Note:

The use of the words "terminal" and "non-terminal" can be a bit confusing, but I (and the ECMA standard) use them with the standard meaning in a context-free grammar.

A context-free grammar consists of productions, each of which has the form:

N ⇒ S …

where N is a non-terminal symbol and S is a possibly-empty sequence of either terminal or non-terminal symbols. Terminal symbols are atoms in the representation of the string to be recognized.

The standard parsing model divides the parse into two levels: lexical and syntactic. The original input is a sequence of characters; lexical analysis turns this into a sequence of lexemes, which are the input to the syntactic parse.

A standard context-free grammar has a single goal symbol, which is one of the non-terminals defined by the grammar. The parse succeeds if the entire input can be reduced to this non-terminal.

A lexical scan can be viewed as a context-free grammar with an ordered list of goal symbols. It tries each goal symbol in turn on successively longer prefixes of the input, and accepts the first goal symbol which matched the longest prefix. (in practice, this is all done in parallel; I'm talking conceptually here.) When ECMA-262 talks about different lexical goals, it really means different lists of possible goal non-terminals.

It's also useful to augment symbols with semantic attributes; these attributes do not influence the parse, but they are useful once the parse is done. In particular, the parse tree is built by attaching a tree node as an attribute to each non-terminal created from a production during the parse, so that the final result of the parse is not the non-terminal symbol as such (that's known before the parse starts) but rather the semantic attributes attached to that particular instance of a non-terminal, while the result of the lexical scan at each point is a non-terminal symbol and its associated semantic attributes; typical, the semantic attribute will be the associated input sequence, or some function of those characters.

In any event, the two-level parse involves feeding the output non-terminals of the lexical level as terminals for the syntactic level.

Community
  • 1
  • 1
rici
  • 234,347
  • 28
  • 237
  • 341
  • @Magnus A non-terminal symbol might be used as an annotation in the parse tree, but it is not the output itself (even in the lexical phase, where the output parse tree degenerates to basically a list). Apart from that, your understanding [seems](https://stackoverflow.com/a/49695955/1048572) [fine](https://stackoverflow.com/q/49596932/1048572). And rici already explained that production rules are not "applied" in a real-world parser. – Bergi Aug 15 '18 at 18:03
  • @Bergi: Yeah, I was abusing terminology a bit. I said "non-terminal", not "non-terminal symbol" leaving implicit the understanding that the non-terminal (whether lexical or syntactic) also contains semantic attributes. – rici Aug 15 '18 at 18:11