I am definitely not very good at designing grammar (you probably figured that out) but this triggered my Aha moment:
A lot of people have pointed out to me that writing grammars for nearley is hard. The thing is, writing grammars is, in general, very hard. It doesn’t help that certain grammar-related problems are provably undecidable.
See https://nearley.js.org/docs/how-to-grammar-good
And:
Using a tokenizer has many benefits. It…
- …often makes your parser faster by more than an order of magnitude.
- …allows you to write cleaner, more maintainable grammars.
- …helps avoid ambiguous grammars in some cases. [...]
See https://nearley.js.org/docs/tokenizers
I know that nearley recommends using moo-lexer:
nearley supports and recommends Moo, a super-fast lexer.
See https://nearley.js.org/docs/tokenizers
So I googled around and found this amazing tutorial on YouTube which definitely unblocked me. Thank you so much @airportyh!
At first I thought this was way too complicated for my use case but it turned out that using a lexer actually made things both possible and simpler!
For the sake of simplicity I will provide a solution with a truncated RIS file:
sample.ris
KW - foo
bar
baz
KW - bat
This file should yield ['foo bar baz', 'bat']
after parsing.
First let's install some stuff
yarn add nearley
yarn add moo
Now let's define our lexer
lexer.js
const moo = require('moo');
const lexer =
moo.compile
( { NL: {match: /[\n]/, lineBreaks: true}
, KW: 'KW'
, SEPARATOR: " - "
, CONTENT: /[a-z]+/
}
);
module.exports = lexer;
We have defined four tokens:
- A newline character
NL
- The
KW
keyword ... keyword!
- The
SEPARATOR
between a tag and its content
- The
CONTENT
of the tag
Next let's define our grammar
grammar.ne
@{% const lexer = require('./lexer.js'); %}
@lexer lexer
@builtin "whitespace.ne"
RECORD -> _ KW:+ {% ([, keywords]) => [].concat(...keywords) %}
KW -> %KW %SEPARATOR LINE:+ {% ([,,lines]) => lines.join(' ') %}
LINE -> %CONTENT __ {% ([{value}]) => value %}
Note: see how we can refer to the tokens defined in the lexer by prefixing with %
!
Now we need to compile our grammar
Nearley ships with a compiler:
yarn -s nearleyc grammar.ne > grammar.js
You can also define a compile
script in your package.json
:
{
...
"scripts": {
"compile": "nearleyc grammar.ne > grammar.js",
}
...
}
Finally let's build a parser and use it!
const nearley = require('nearley');
const grammar = require('./grammar.js');
module.exports =
str => {
const parser = new nearley.Parser(nearley.Grammar.fromCompiled(grammar));
parser.feed(str);
return parser.results[0];
};
Note: this is requiring the compiled grammar i.e. grammar.js
Let's throw some text at it:
const parser = require('./parser.js');
parser(`
KW - foo
bar
baz
KW - bat
`);
//=> [ 'foo bar baz', 'bat' ]
Final tip: you can also test your grammar with nearley-test
:
cat sample.ris | yarn -s nearley-test -- -q grammar.js