I'm working on a reStructuredText transpiler in Rust, and am in need of some advice concerning how lexing should be structured in languages that have recursive structures. For example lists within lists are possible in rST:
* This is a list item
* This is a sub list item
* And here we are at the preceding indentation level again.
The default docutils.parsers.rst
took the approach of scanning the input one line at a time:
The reStructuredText parser is implemented as a state machine, examining its input one line at a time.
The state machine mentioned basically operates on a set of states of the form (regex, match_method, next_state)
. It tries to match the current line to the regex
based on the current state and runs match_method
while transitioning to the next_state
if a match succeeds, doing this until it runs out of lines to scan.
My question then is, is this the best approach to scanning a language such as rST? My approach thus far has been to create a Chars
iterator of the source and eat away at the source while trying to match against structures at the current Unicode scalar. This works to some extent when all I'm doing is scanning inline content, but I've now run into the realization that handling recursive body level structures like nested lists is going to be a pain in the butt. It feels like I'm going to need a whole bunch of states with duplicate regexes and related methods in many states for matching against indentations before new lines and such.
Would it be better to simply have and iterator of the lines of the source and match on a per-line basis, and if a line such as
* this is an indented list item
is encountered in State::Body
, simply transition to a state such as State::BulletList
and start lexing lines based on the rules specified there? The above line could be lexed for example as a sequence
TokenType::Indent, TokenType::Bullet, TokenType::BodyText
Any thoughts on this?