This is my first time writing a parser using a grammar and a parser generator. I want to parse some kind of asn.1 format using the lark python module.
Here is an example of the data I'm trying to parse:
text = """
start_thing {
literal {
length 100,
fuzz lim unk,
seq-data gap {
type fragment,
linkage linked,
linkage-evidence {
{
type unspecified
}
}
}
},
loc int {
from 0,
to 1093,
strand plus,
id gi 384632836
}
}
"""
The structure can contain all sorts of nodes, and I can't know in advance exactly what tags or combination of tags I should expect. However, there are some structures I want to be able to parse, like the "loc int {...}" part.
Here is the grammar I tried, where I used numbers to define priorities:
grammar = """\
thing: "start_thing" node
strand_info.5: "strand plus"
| "strand minus"
locus_info.4: "loc int" "{" "from" INT "," "to" INT "," strand_info "," "id gi" INT "}"
nodes.1: node?
| node ("," node)*
node.1: locus_info
| TAGS? INT -> intinfo
| TAGS? "{" nodes "}" -> subnodes
| TAGS -> onlytags
TAGS.2: TAGWORD (WS TAGWORD)*
TAGWORD.3: ("_"|LETTER)("_"|"-"|LETTER|DIGIT)*
%import common.WS
%import common.LETTER
%import common.DIGIT
%import common.INT
%ignore WS
"""
I thought the priorities (in form of appended numbers) would be enough for the "loc int" things to be recognized in priority over a more general node kind, but this part seems to be parsed as a subnodes
instead as a locus_info
when I run make a parser for the above grammar and run it on the piece of text above:
parser = Lark(grammar, start="thing", ambiguity="explicit")
parsed = parser.parse(text)
print(parsed.pretty())
I obtain the following:
thing
subnodes
nodes
subnodes
literal
nodes
intinfo
length
100
onlytags fuzz lim unk
subnodes
seq-data gap
nodes
onlytags type fragment
onlytags linkage linked
subnodes
linkage-evidence
nodes
subnodes
nodes
onlytags type unspecified
subnodes
loc int
nodes
intinfo
from
0
intinfo
to
1093
onlytags strand plus
intinfo
id gi
384632836
What am I doing wrong?
Note: I've seen a related question (Priority in grammar using Lark) but I do not see how to apply its answers to my problem. I' don't think that I am in a case where I can fully disambiguate my grammar (too many possible cases in the real data), and I didn't understand what the ambiguity="explicit"
option was supposed to do.
Edit: inverting priorities
I tried inverting priorities, as follows:
grammar = """\
thing: "start_thing" node
strand_info.1: "strand plus"
| "strand minus"
locus_info.2: "loc int" "{" "from" INT "," "to" INT "," strand_info "," "id gi" INT "}"
nodes.5: node?
| node ("," node)*
node.5: locus_info
| TAGS? INT -> intinfo
| TAGS? "{" nodes "}" -> subnodes
| TAGS -> onlytags
TAGS.4: TAGWORD (WS TAGWORD)*
TAGWORD.3: ("_"|LETTER)("_"|"-"|LETTER|DIGIT)*
%import common.WS
%import common.LETTER
%import common.DIGIT
%import common.INT
%ignore WS
"""
parser = Lark(grammar, start="thing", ambiguity="explicit")
parsed = parser.parse(text)
print(parsed.pretty())
However, the output is exactly the same. It is like if those priorities were ignored, or if there were actually no ambiguities, because my locus_info
rule was not correctly specified.