3

I'm trying to write a grammar for content in the RIS format with

Example of file:

TY  - JOUR
KW  - foo
KW  - bar
ER  - 

A *.ris file always starts with the tag TY and ends with the tag ER. In between there can be many other tags like KW (keyword).

The spec says that a single KW statement can span across multiple lines.

So this:

TY  - JOUR
KW  - foo
bar
baz
KW  - bat
ER  - 

Is equivalent to:

TY  - JOUR
KW  - foo bar baz
KW  - bat
ER  - 

I'm struggling to come up with a grammar that says something like:

A keyword starts with KW followed by - followed by either:

  • letters until the end of the line
  • letters until the end of the line and any other lines until the next keyword

Whatever I try ends up "swallowing" all other statements, e.g. the first multi-line keyword captures everything else after it.

How would you write this rule? I'm not necessarily interested in a specific answer. Anything that triggers my "Aha" moment will do!

customcommander
  • 17,580
  • 5
  • 58
  • 84

1 Answers1

1

I am definitely not very good at designing grammar (you probably figured that out) but this triggered my Aha moment:

A lot of people have pointed out to me that writing grammars for nearley is hard. The thing is, writing grammars is, in general, very hard. It doesn’t help that certain grammar-related problems are provably undecidable.

See https://nearley.js.org/docs/how-to-grammar-good

And:

Using a tokenizer has many benefits. It…

  • …often makes your parser faster by more than an order of magnitude.
  • …allows you to write cleaner, more maintainable grammars.
  • …helps avoid ambiguous grammars in some cases. [...]

See https://nearley.js.org/docs/tokenizers

I know that recommends using :

nearley supports and recommends Moo, a super-fast lexer.

See https://nearley.js.org/docs/tokenizers

So I googled around and found this amazing tutorial on YouTube which definitely unblocked me. Thank you so much @airportyh!

At first I thought this was way too complicated for my use case but it turned out that using a lexer actually made things both possible and simpler!


For the sake of simplicity I will provide a solution with a truncated RIS file:

sample.ris

KW  - foo
bar
baz
KW  - bat

This file should yield ['foo bar baz', 'bat'] after parsing.

First let's install some stuff

yarn add nearley
yarn add moo

Now let's define our lexer

lexer.js

const moo = require('moo');

const lexer =
  moo.compile
    ( { NL: {match: /[\n]/, lineBreaks: true}
      , KW: 'KW'
      , SEPARATOR: "  - "
      , CONTENT: /[a-z]+/
      }
    );

module.exports = lexer;

We have defined four tokens:

  1. A newline character NL
  2. The KW keyword ... keyword!
  3. The SEPARATOR between a tag and its content
  4. The CONTENT of the tag

Next let's define our grammar

grammar.ne

@{% const lexer = require('./lexer.js'); %}
@lexer lexer
@builtin "whitespace.ne"

RECORD -> _ KW:+                {% ([, keywords]) => [].concat(...keywords) %}
KW     -> %KW %SEPARATOR LINE:+ {% ([,,lines])    => lines.join(' ')        %}
LINE   -> %CONTENT __           {% ([{value}])    => value                  %}

Note: see how we can refer to the tokens defined in the lexer by prefixing with %!

Now we need to compile our grammar

Nearley ships with a compiler:

yarn -s nearleyc grammar.ne > grammar.js

You can also define a compile script in your package.json:

{

  ...

  "scripts": {
    "compile": "nearleyc grammar.ne > grammar.js",
  }

  ...

}

Finally let's build a parser and use it!

const nearley = require('nearley');
const grammar = require('./grammar.js');

module.exports =
  str => {
    const parser = new nearley.Parser(nearley.Grammar.fromCompiled(grammar));
    parser.feed(str);
    return parser.results[0];
  };

Note: this is requiring the compiled grammar i.e. grammar.js

Let's throw some text at it:

const parser = require('./parser.js');

parser(`
KW  - foo
bar
baz
KW  - bat
`);
//=> [ 'foo bar baz', 'bat' ]

Final tip: you can also test your grammar with nearley-test:

cat sample.ris | yarn -s nearley-test -- -q grammar.js
customcommander
  • 17,580
  • 5
  • 58
  • 84