1

This is my first time writing a parser using a grammar and a parser generator. I want to parse some kind of asn.1 format using the lark python module.

Here is an example of the data I'm trying to parse:

text = """
start_thing {
  literal {
    length 100,
    fuzz lim unk,
    seq-data gap {
      type fragment,
      linkage linked,
      linkage-evidence {
        {
          type unspecified
        }
      }
    }
  },
  loc int {
    from 0,
    to 1093,
    strand plus,
    id gi 384632836
  }
}
"""

The structure can contain all sorts of nodes, and I can't know in advance exactly what tags or combination of tags I should expect. However, there are some structures I want to be able to parse, like the "loc int {...}" part.

Here is the grammar I tried, where I used numbers to define priorities:

grammar = """\
thing: "start_thing" node
strand_info.5: "strand plus"
    | "strand minus"
locus_info.4: "loc int" "{" "from" INT "," "to" INT "," strand_info "," "id gi" INT "}"
nodes.1: node?
    | node ("," node)*
node.1: locus_info
    | TAGS? INT           -> intinfo
    | TAGS? "{" nodes "}" -> subnodes
    | TAGS                -> onlytags
TAGS.2: TAGWORD (WS TAGWORD)*
TAGWORD.3: ("_"|LETTER)("_"|"-"|LETTER|DIGIT)*
%import common.WS
%import common.LETTER
%import common.DIGIT
%import common.INT
%ignore WS
"""

I thought the priorities (in form of appended numbers) would be enough for the "loc int" things to be recognized in priority over a more general node kind, but this part seems to be parsed as a subnodes instead as a locus_info when I run make a parser for the above grammar and run it on the piece of text above:

parser = Lark(grammar, start="thing", ambiguity="explicit")
parsed = parser.parse(text)
print(parsed.pretty())

I obtain the following:

thing
  subnodes
    nodes
      subnodes
        literal
        nodes
          intinfo
            length
            100
          onlytags  fuzz lim unk
          subnodes
            seq-data gap
            nodes
              onlytags  type fragment
              onlytags  linkage linked
              subnodes
                linkage-evidence
                nodes
                  subnodes
                    nodes
                      onlytags  type unspecified
      subnodes
        loc int
        nodes
          intinfo
            from
            0
          intinfo
            to
            1093
          onlytags  strand plus
          intinfo
            id gi
            384632836

What am I doing wrong?

Note: I've seen a related question (Priority in grammar using Lark) but I do not see how to apply its answers to my problem. I' don't think that I am in a case where I can fully disambiguate my grammar (too many possible cases in the real data), and I didn't understand what the ambiguity="explicit" option was supposed to do.


Edit: inverting priorities

I tried inverting priorities, as follows:

grammar = """\
thing: "start_thing" node
strand_info.1: "strand plus"
    | "strand minus"
locus_info.2: "loc int" "{" "from" INT "," "to" INT "," strand_info "," "id gi" INT "}"
nodes.5: node?
    | node ("," node)*
node.5: locus_info
    | TAGS? INT           -> intinfo
    | TAGS? "{" nodes "}" -> subnodes
    | TAGS                -> onlytags
TAGS.4: TAGWORD (WS TAGWORD)*
TAGWORD.3: ("_"|LETTER)("_"|"-"|LETTER|DIGIT)*
%import common.WS
%import common.LETTER
%import common.DIGIT
%import common.INT
%ignore WS
"""
parser = Lark(grammar, start="thing", ambiguity="explicit")
parsed = parser.parse(text)
print(parsed.pretty())

However, the output is exactly the same. It is like if those priorities were ignored, or if there were actually no ambiguities, because my locus_info rule was not correctly specified.

bli
  • 7,549
  • 7
  • 48
  • 94
  • If I'm not mistaken asn.1 is a non-ambiguous grammar so I don't think using priorities is the right call here. Take a look to this [one](https://github.com/richb-hanover/mibble-2.9.2/blob/master/src/grammar/asn1.grammar), you could try to convert than one to lark syntax. A more interesting question would be, when is it strictly necessary to use priorities on a lark grammar? I'd say if you can avoid using priorities, you should. – BPL Jun 19 '18 at 13:34
  • Actually, I suspect the format is not "real" asn.1, but just something inspired by asn.1. – bli Jun 19 '18 at 13:41
  • Priority is only relevant as a way to choose between different possible parse trees (aka derivations). It seems like your input can only be parsed in one way, or ambiguity='explicit' would return all the different derivations. So priority won't affect the result. – Erez Jun 19 '18 at 14:15
  • @Erez I suppose my `locus_info` rule is wrong, then. How should I write it so that it matches the `loc int {...}` node? – bli Jun 22 '18 at 11:01

1 Answers1

1

I think you should change your priorities. The "locus_info.4" is the most precise rule so it has to be first in priority.

Dryslope
  • 11
  • 1
  • I thought that the higher the number, the higher the priority, based on the example given in the documentation: https://github.com/lark-parser/lark/wiki/Grammar-Reference `DECIMAL.2: INTEGER "." INTEGER //# Will be matched before INTEGER` – bli Jun 21 '18 at 09:51
  • Hum yes but there isn't other priorities in this example. And in the official documentation of lark, it's written "rule.n: ... Rule with priority n" (https://github.com/lark-parser/lark/blob/master/docs/lark_cheatsheet.pdf), so to me it has to be red "rule.1: ... has the first priority. Maybe I'm interpretting it wrongly. Did you try to change it ? I didn't succed to test your code to see if it works – Dryslope Jun 21 '18 at 10:07
  • I think I tried, and it seemed to have no effect. But I will try again next time I get back to this work and report the results of my experiments in the question. – bli Jun 21 '18 at 10:18
  • Ok, and if you try to do something like "locus_info: LOC "{" ..." and "LOC: "loc int" " ? – Dryslope Jun 22 '18 at 15:17
  • This does not seem to have any effect. This seems to be only parsed as a `subnodes`, not as a `locus_info`. – bli Jun 22 '18 at 15:22