I'm trying to parse a particular syntax for positions in biological sequences. The positions can have forms like:
12 -- a simple position in the sequence
12+34 -- a complex position as a base (12) and offset(+34)
12_56 -- a range, from 12 to 56
12+34_56-78 -- a range as a start to end, where either or both may be simple or complex
I'd like to have these parsed as dicts, roughly like this:
12 -> { 'start': { 'base': 12, 'offset': 0 }, 'end': None }
12+34 -> { 'start': { 'base': 12, 'offset': 34 }, 'end': None }
12_56 -> { 'start': { 'base': 12, 'offset': 0 },
'end': { 'base': 56, 'offset': 0 } }
12+34_56-78 -> { 'start': { 'base': 12, 'offset': 0 },
'end': { 'base': 56, 'offset': -78 } }
I've made several stabs using pyparsing. Here's one:
from pyparsing import *
integer = Word(nums)
signed_integer = Word('+-', nums)
underscore = Suppress('_')
position = integer.setResultsName('base') + Or(signed_integer,Empty).setResultsName('offset')
interval = position.setResultsName('start') + Or(underscore + position,Empty).setResultsName('end')
The results are close to what I want:
In [20]: hgvspyparsing.interval.parseString('12-34_56+78').asDict()
Out[20]:
{'base': '56',
'end': (['56', '+78'], {'base': [('56', 0)], 'offset': [((['+78'], {}), 1)]}),
'offset': (['+78'], {}),
'start': (['12', '-34'], {'base': [('12', 0)], 'offset': [((['-34'], {}), 1)]})}
Two questions:
asDict() only worked on the root parseResult. Is there a way to cajole pyparsing into returning a nested dict (and only that)?
How do I get the optionality of the end of a range and the offset of a position? The Or() in the position rule doesn't cut it. (I tried similarly for the end of the range.) Ideally, I'd treat all positions as special cases of the most complex form (i.e., { start: {base, end}, end: { base, end } }), where the simpler cases use 0 or None.)
Thanks!