3

I'm writing a lexer for Markdown. In the process, I realized that I do not fully understand what its core responsibility should be.

The most common definition of a lexer is that it translates an input stream of characters into an output stream of tokens.

Input         →  Output 
(characters)     (tokens)

That sounds quite simple at first, but the question that arises here is how much semantic interpretation the lexer should do before handing over its output of tokens to the parser.


Take this example of Markdown syntax:

### Headline
*This* is an emphasized word.

It might be translated by a lexer into the following series of tokens:

Lexer 1 Output

.headline("Headline")
.emphasis("This")
.text"(" is an emphasized word.")

But it might as well be translated on a more granular level, depending on the grammar (or the set of lexemes) used:

Lexer 2 Output

.controlSymbol("#")
.controlSymbol("#")
.controlSymbol("#")
.text(" Headline")
.controlSymbol("*")
.text("This")
.controlSymbol("*")
.text"(" is an emphasized word.")

It seems a lot more practical to have the lexer produce an output similar to that of Lexer 1, because the parser will then have an easier job. But it also means that the lexer needs to semantically understand what the code means. It's not merely mapping a sequence of characters to a token. It needs to look ahead and identify patterns. (For example, it needs to be able to be able to distinguish between **Hey* you* and **Hey** you. It cannot simply translate a double asterisk ** into .openingEmphasis, because that depends on the following context.)

According to this Stackoverflow post and the CommonMark definition, it seems to make sense to first break down the Markdown input into a number of blocks (representing one or more lines) and then analyze the contents of each block in a second step. With the example above, this would mean the following:

.headlineBlock("Headline")
.paragraphBlock("*This* is an emphasized word.")

But this wouldn't count as a valid sequence of tokens because some of the lexemes ("*") have not been parsed yet and it wouldn't be right to pass this paragraphBlock to the parser.


So here's my question:

Where do you draw the line?

How much semantic work should the lexer do? Is there some hard cut in the definition of a lexer that I am not aware of?

What would be the best way to define a grammar for the lexer?

Mischa
  • 15,816
  • 8
  • 59
  • 117
  • 'How much semantic interpretation the lexer should do': None. Zero. Nada. Nulla. Rien. – user207421 Feb 28 '19 at 10:31
  • So you're saying even if the lexer read an opening parenthesis `(` in the source code of a programming language, it should always return `.openingParenthesis`, regardless of whether it's part of a string/comment or not? Because that would already be a semantic interpretation, to distinguish between parentheses that have a meaning in code and parentheses that don't. It's already saying: "This parenthesis has another meaning than that parenthesis." Even if the lexer doesn't know *what* the meaning is, it *does* know that there *are* two different meanings and needs to distinguish between them. – Mischa Feb 28 '19 at 11:04
  • Yes, that's exactly what I'm saying. It is up to the parser to parse, not the lexer. Scanners deal with regular languages, parsers with CFLs. NB This is syntax, not semantics. NB 2 It is up to the scanner to *remove* comments, not return them to the oarssrparser in any way. It is possible that your language as defined so far is ambiguous. – user207421 Feb 28 '19 at 11:45
  • @user207421 Comments and string literals are lexed as a single token, so since tokens can't be inside other tokens, a `(` inside a comment or string literal would *not* produce a `(` token. No lexer I've ever seen leaves the handling of comments and string literals to the parser. – sepp2k Feb 28 '19 at 17:10
  • 2
    @Mischa If you went the first way, then what should the result be for `### Head *line*`? `.headline(.text("Head"), .emphasis("line"))`? Then your lexer would be generating a tree and thus actually be a parser (of course, you might very well decide that something as simple as Markdown doesn't need a lexer and just write a lexerless parser instead). Generally, anything that can be nested, can't be a single token. So the result of a lexer should be more like your second version (though I'd personally treat `###` as a single token). – sepp2k Feb 28 '19 at 17:32

1 Answers1

0

BNF is used to describe many languages / create lexers and parsers

MOST use a Look right 1 to define a unambiguous format.

Recently I was looking at playing with SQL BNF https://github.com/ronsavage/SQL/blob/master/sql-92.bnf

I made the decision that my lexer would return only terminal token strings. Similar to your option 1.

  • '('
  • ')'
  • 'KEWORDS'
  • '-- comment eol'
  • '12.34'
  • ...

Any rule that defined the syntax tree would be left to the parser.

  • <Document> := <lines>
  • <Lines> := <line> [<Lines>]
  • <line> := ...