1

I am writing a Pyparsing grammar to convert Creole markup to HTML. I'm stuck because there's a bit of conflict trying to parse these two constructs:

Image link: {{image.jpg|title}}
Ignore formatting: {{{text}}}

The way I'm parsing the image link is as follows (note that this converts perfectly fine):

def parse_image(s, l, t):
    try:
        link, title = t[0].split("|")
    except ValueError:
        raise ParseFatalException(s,l,"invalid image link reference: " + t[0])
    return '<img src="{0}" alt="{1}" />'.format(link, title)

image = QuotedString("{{", endQuoteChar="}}")
image.setParseAction(parse_image)

Next, I wrote a rule so that when {{{text}}} is encountered, simply return what's between the opening and closing braces without formatting it:

n = QuotedString("{{{", endQuoteChar="}}}")
n.setParseAction(lambda x: x[0])

However, when I try to run the following test case:

text = italic | bold | hr | newline | image | n
print text.transformString("{{{ //ignore formatting// }}}")

I get the following stack trace:

Traceback (most recent call last):
File "C:\Users\User\py\kreyol\parser.py", line 36, in <module>
print text.transformString("{{{ //ignore formatting// }}}")
File "C:\Python27\lib\site-packages\pyparsing.py", line 1210, in transformString
raise exc
pyparsing.ParseFatalException: invalid image link reference: { //ignore formatting//  (at char 0), (line:1, col:1)

From what I understand, the parser encounters the {{ first and tries to parse the text as an image instead of text without formatting. How can I solve this ambiguity?

lanour
  • 123
  • 1
  • 5
  • I haven't used pyparsing, but a quick look suggests Regex("{{[^{].*}}" should work (for image... i.e. define Image is being two {'s followed by anything other than a {, followed by anything followed by two }}'s) – Foon Apr 15 '15 at 03:08

1 Answers1

3

The issue is with this expression:

text = italic | bold | hr | newline | image | n

Pyparsing works strictly left-to-right, with no lookahead. Using '|' operators, you construct a pyparsing MatchFirst expression, which will match the first match of all the alternatives, even if a later match is better.

You can change the evaluation to use "longest match" by using the '^' operator instead:

text = italic ^ bold ^ hr ^ newline ^ image ^ n

This would have a performance penalty in that every expression is tested, even though there is no possibility of a better match.

An easier solution is to just reorder the expressions in your list of alternatives: test for n before image:

text = italic | bold | hr | newline | n | image

Now when evaluating alternatives, it will look for the leading {{{ of n before the leading {{ of image.

This often crops up when people define numeric terms, and accidentally define something like:

integer = Word(nums)
realnumber = Combine(Word(nums) + '.' + Word(nums))
number = integer | realnumber

In this case, number will never match a realnumber, since the leading whole number part will be parsed as an integer. The fix, as in your case, is to either use '^' operator, or just reorder:

number = realnumber | integer
PaulMcG
  • 62,419
  • 16
  • 94
  • 130
  • Thanks a ton! If I evaluate based on longest match, will it take significantly long to convert for large documents? Or is the performance penalty negligible for something as trivial as a markup converter? – lanour Apr 15 '15 at 03:35
  • Wiki markups are surprisingly difficult parsers. They have to handle nesting of markup attributes, often have overloaded symbols, and sometimes are indentation sensitive (especially difficult to implement with pyparsing). For instance, I see that Creole uses `'**'` for indicating bold *and* for subitems in a bulleted list - resolving these could be tricky. – PaulMcG Apr 15 '15 at 11:12
  • I won't try to predict how the performance will go - it is too dependent on the rest of your parser, and on the wiki input too. Now that you know the options, you can try them for yourself. – PaulMcG Apr 15 '15 at 11:15