Writing a lexer for a context sensitive markup language, that has recursive structures such as nested lists

Question

I'm working on a reStructuredText transpiler in Rust, and am in need of some advice concerning how lexing should be structured in languages that have recursive structures. For example lists within lists are possible in rST:

* This is a list item

  * This is a sub list item

* And here we are at the preceding indentation level again.

The default docutils.parsers.rst took the approach of scanning the input one line at a time:

The reStructuredText parser is implemented as a state machine, examining its input one line at a time.

The state machine mentioned basically operates on a set of states of the form (regex, match_method, next_state). It tries to match the current line to the regex based on the current state and runs match_method while transitioning to the next_state if a match succeeds, doing this until it runs out of lines to scan.

My question then is, is this the best approach to scanning a language such as rST? My approach thus far has been to create a Chars iterator of the source and eat away at the source while trying to match against structures at the current Unicode scalar. This works to some extent when all I'm doing is scanning inline content, but I've now run into the realization that handling recursive body level structures like nested lists is going to be a pain in the butt. It feels like I'm going to need a whole bunch of states with duplicate regexes and related methods in many states for matching against indentations before new lines and such.

Would it be better to simply have and iterator of the lines of the source and match on a per-line basis, and if a line such as

    * this is an indented list item

is encountered in State::Body, simply transition to a state such as State::BulletList and start lexing lines based on the rules specified there? The above line could be lexed for example as a sequence

TokenType::Indent, TokenType::Bullet, TokenType::BodyText

Any thoughts on this?

You may want to ask your question on r/compilers. FYI you seem to be confusing _lexing_ and _parsing_; and in general Regex are not the best approach to either lexing nor parsing -- though they can be used for quick & dirty experimentation. — Matthieu M., Jun 05 '20 at 10:04
@MatthieuM. I do have an idea of what lexing and parsing are: a lexer generates tokens while a parser builds a syntax tree out of those. The docutils "parser" seems to be a mix of both, and my question pertained to the idea of building a completely separate lexer and parser for rST. — sesodesa, Jun 05 '20 at 10:45
A separate lexer should not _care_ about whether `*` is there for (1) a bullet list, (2) the start of italics, or (3) the start of bold. In that sense, a lexer is context-less. And therefore, your lexer should not require a state-machine. — Matthieu M., Jun 05 '20 at 10:56
The one wrinkle is if you want to represent indentation as part of the token stream. I do personally think it is cleaner than annotating each and every token with line/column information. If you merely representation the indentation level (x spaces/tabs) then this is still context insensitive, if you represent Indent / Deindent, then it is _slightly_ context sensitive, but still does not require a state machine -- only the indentation level of the previous (significant) line. — Matthieu M., Jun 05 '20 at 10:57
@MatthieuM. I guess my confusion stems from the fact that I would still use compiled regular expressions (a.k.a. deterministic finite automata) to detect these simpler non-contextual tokens, so when somebody tells me I'm not using a state machine with those I get cognitive dissonance. In any case, I think this is the direction I wish to go in, generating tokens line-by-line. The parser can then worry about indentation (whitespace at hte beginning of a line) levels, as it will know how to count as a pushdown automaton. — sesodesa, Jun 05 '20 at 11:25
@JohnKugelman If I happened to know the length of indent, as in the number of spaces, couldn't I check the level of indentation in the parser, which *will* have states anyways? — sesodesa, Jun 05 '20 at 13:19
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/215370/discussion-between-matthieu-m-and-sesodesa). — Matthieu M., Jun 05 '20 at 13:23

Ira Baxter · Accepted Answer · 2020-06-05T17:07:49.340

I don't know much about rST. But you say it has "recursive" structures. If that's that case, you can't fully lex it as a recursive structure using just state machines or regexes or even lexer generators.

But this the wrong way to think about it. The lexer's job is to identify the atoms of the language. A parser's job is to recognize structure, especially if it is recursive (yes, parsers often build trees recording the recursive structures they found). So build the lexer ignoring context if you can, and use a parser to pick up the recursive structures if you need them. You can read more about the distinction in my SO answer about Parsers Vs. Lexers https://stackoverflow.com/a/2852716/120163

If you insist on doing all of this in the lexer, you'll need to augment it with a pushdown stack to track the recursive structures. Then what are you building is a sloppy parser disguised as lexer. (You will probably still want a real parser to process the output of this "lexer").

Having a pushdown stack actually useful if the language has different atoms in different contexts especially if the contexts nest; in this case what you want is mode stack that you change as the lexer encounters tokens that indicate a switch from one mode to another. A really useful extension of this idea is to have mode changes select what amounts to different lexers, each of which produces lexemes unique to that mode.

As an example you might do this to lex a language that contains embedded SQL. We build parsers for JavaScript; our lexer uses a pushdown stack to process the content of regexp literals and track nesting of { ... } [...] and (... ). (This has arguably a downside: it rejects versions of JQuery.js that contain malformed regexes [yes, they exist]. Javascript doesn't care if you define a bad regex literal and never use it, but that seems pretty pointless.)

A special case of the stack occurs if you only have track single "(" ... ")" pairs or the equivalent. In this case you can use a counter to record how many "pushes" or "pop" you might have done on a real stack. If you have two or more pairs of tokens like this, counters don't work.

Yeah, I think I now have a clearer picture of the separation of concerns. What I intend to do is to simply scan one line at a time and check for tokens like bullets, normal text, indents and so forth. I'll then pass that token stream to the parser (to be implemented). The thing is, I want to get rid of some extra whitespace already in the lexing stage, and making this more modular will make it easier for me to grasp as well. My first transpiler, etc. — sesodesa, Jun 05 '20 at 13:25

Writing a lexer for a context sensitive markup language, that has recursive structures such as nested lists

1 Answers1