3

I'm pretty new to nearly.js, and I would like to know what tokenizers/lexers do compared to rules, according to the website:

By default, nearley splits the input into a stream of characters. This is called scannerless parsing. A tokenizer splits the input into a stream of larger units called tokens. This happens in a separate stage before parsing. For example, a tokenizer might convert 512 + 10 into ["512", "+", "10"]: notice how it removed the whitespace, and combined multi-digit numbers into a single number.

Wouldn't that be the same as:

Math -> Number _ "+" _ Number
Number -> [0-9]:+

I don't see what the purpose of lexers are. I see that rules are always useable in this case and there is no need for lexers.

kepe
  • 282
  • 1
  • 18
  • Lexers sometimes make use of regular expressions that determine if you are matching on keywords so you can apply the rules of a language. https://nearley.js.org/docs/tokenizers – Daniel Gale Aug 27 '18 at 19:20
  • But rules can also be used in this case right? – kepe Aug 27 '18 at 19:22
  • 2
    What you are seeing is an example of essentially a pre-built lexer. The benefits are listed on the page and may not matter on a grammar this simple. `…often makes your parser faster by more than an order of magnitude.` `…allows you to write cleaner, more maintainable grammars.` `…helps avoid ambiguous grammars in some cases. For example, a tokenizer can easily tell you that superclass is a single keyword, not a sequence of super and class keywords.` `…gives you lexical information such as line numbers for each token. This lets you make better error messages.` – Daniel Gale Aug 27 '18 at 19:29
  • @DanielGale Could you give an example of a practical use for lexers, and post it as an answer? – kepe Aug 27 '18 at 19:42

1 Answers1

4

After fiddling around with them, I found out the use of tokenizers, say we had the following:

Keyword -> "if"|"else"
Identifier -> [a-zA-Z_]+

This won't work, if we try compiling this, we get ambiguous grammar, "if" will be matched as both a keyword and an Identifier, a tokenizer however:

{
"keyword": /if|else/
"identifier": /[a-zA-Z_]+/
}

Trying to compile this will not result in ambiguous grammar, because tokenizers are smart (at least the one shown in this example, which is Moo).

kepe
  • 282
  • 1
  • 18