Nearley Tokenizers vs Rules

Question

I'm pretty new to nearly.js, and I would like to know what tokenizers/lexers do compared to rules, according to the website:

By default, nearley splits the input into a stream of characters. This is called scannerless parsing. A tokenizer splits the input into a stream of larger units called tokens. This happens in a separate stage before parsing. For example, a tokenizer might convert 512 + 10 into ["512", "+", "10"]: notice how it removed the whitespace, and combined multi-digit numbers into a single number.

Wouldn't that be the same as:

Math -> Number _ "+" _ Number
Number -> [0-9]:+

I don't see what the purpose of lexers are. I see that rules are always useable in this case and there is no need for lexers.

Lexers sometimes make use of regular expressions that determine if you are matching on keywords so you can apply the rules of a language. https://nearley.js.org/docs/tokenizers — Daniel Gale, Aug 27 '18 at 19:20
What you are seeing is an example of essentially a pre-built lexer. The benefits are listed on the page and may not matter on a grammar this simple. `…often makes your parser faster by more than an order of magnitude.` `…allows you to write cleaner, more maintainable grammars.` `…helps avoid ambiguous grammars in some cases. For example, a tokenizer can easily tell you that superclass is a single keyword, not a sequence of super and class keywords.` `…gives you lexical information such as line numbers for each token. This lets you make better error messages.` — Daniel Gale, Aug 27 '18 at 19:29
@DanielGale Could you give an example of a practical use for lexers, and post it as an answer? — kepe, Aug 27 '18 at 19:42

score 4 · Answer 1 · answered Aug 28 '18 at 11:42

After fiddling around with them, I found out the use of tokenizers, say we had the following:

Keyword -> "if"|"else"
Identifier -> [a-zA-Z_]+

This won't work, if we try compiling this, we get ambiguous grammar, "if" will be matched as both a keyword and an Identifier, a tokenizer however:

{
"keyword": /if|else/
"identifier": /[a-zA-Z_]+/
}

Trying to compile this will not result in ambiguous grammar, because tokenizers are smart (at least the one shown in this example, which is Moo).

Nearley Tokenizers vs Rules

1 Answers1