Preserve newlines in nestedExpr

Question

Is it possible for nestedExpr to preserve newlines?

Here is a simple example:

import pyparsing as pp

# Parse expressions like: \name{body}
name = pp.Word( pp.alphas )
body = pp.nestedExpr( '{', '}' )
expr = '\\' + name('name') + body('body')

# Example text to parse
txt = '''
This \works{fine}, but \it{
    does not
    preserve newlines
}
'''

# Show results
for e in expr.searchString(txt):
    print 'name: ' + e.name
    print 'body: ' + str(e.body) + '\n'

Output:

name: works
body: [['fine']]

name: it
body: [['does', 'not', 'preserve', 'newlines']]

As you can see, the body of the second expression (\it{ ...) is parsed despite the newlines in the body, but I would have expected the result to store each line in a separate subarray. This result makes it impossible to distinguish body contents with single vs. multiple lines.

PaulMcG · Accepted Answer · 2017-04-17T01:30:44.607

3

I didn't get to look at your answer until just a few minutes ago, and I had already come up with this approach:

body = pp.nestedExpr( '{', '}', content = (pp.LineEnd() | name.setWhitespaceChars(' ')))

Changing body to this definition gives these results:

name: works
body: [['fine']]

name: it
body: [['\n', 'does', 'not', '\n', 'preserve', 'newlines', '\n']]

EDIT:

Wait, if what you want are the separate lines, then perhaps this is more what you are looking for:

single_line = pp.OneOrMore(name.setWhitespaceChars(' ')).setParseAction(' '.join)
multi_line = pp.OneOrMore(pp.Optional(single_line) + pp.LineEnd().suppress())
body = pp.nestedExpr( '{', '}', content = multi_line | single_line )

Which gives:

name: works
body: [['fine']]

name: it
body: [['does not', 'preserve newlines']]

edited Apr 17 '17 at 01:30

answered Apr 17 '17 at 01:16

PaulMcG

62,419
16
94
130

I don't think it gets better than an answer from the author of the package himself! :) Sorry if my suggestion was a bit clumsy, but can I just ask in this one; why do you use `name` in the definition of `body`? I admit it's not entirely clear from my question, but what I am really after are the _raw_ contents between the brackets, ideally untouched by any parsing rule or tokeniser, so I can parse them separately later on (possibly then with different parsing rules, depending on the contents of the parent). – Jonathan H Apr 17 '17 at 07:53
1

To match *anything*, in place of `name` you'd probably use something like `pp.Word(pp.printables, excludeChars="{}")`. You may also have to fiddle with wrapping with `pp.originalTextFor` to get the raw string contents. Welcome to pyparsing! – PaulMcG Apr 17 '17 at 12:06

score 0 · Answer 2 · edited May 23 '17 at 11:54

This extension (based on the code of nestedExpr version 2.1.10) behaves more closely to what I would expect a "nested expression" to return:

import string
from pyparsing import *

defaultWhitechars = string.whitespace
ParserElement.setDefaultWhitespaceChars(defaultWhitechars)

def fencedExpr( opener="(", closer=")", content=None, ignoreExpr=None, stripchars=defaultWhitechars ):

    if content is None:
        if isinstance(opener,basestring) and isinstance(closer,basestring):
            if len(opener) == 1 and len(closer)==1:
                if ignoreExpr is not None:
                    content = Combine(OneOrMore( ~ignoreExpr + CharsNotIn(opener+closer,exact=1)))
                else:
                    content = empty.copy() + CharsNotIn(opener+closer)
            else:
                if ignoreExpr is not None:
                    content = OneOrMore( ~ignoreExpr + ~Literal(opener) + ~Literal(closer))
                else:
                    content = OneOrMore( ~Literal(opener) + ~Literal(closer) )
        else:
            raise ValueError("opening and closing arguments must be strings if no content expression is given")

    if stripchars is not None:
        content.setParseAction(lambda t:t[0].strip(stripchars))

    ret = Forward()
    if ignoreExpr is not None:
        ret <<= Group( Suppress(opener) + ZeroOrMore( ignoreExpr | ret | content ) + Suppress(closer) )
    else:
        ret <<= Group( Suppress(opener) + ZeroOrMore( ret | content )  + Suppress(closer) )
    ret.setName('nested %s%s expression' % (opener,closer))
    return ret

IMHO it fixes a few things:

The original implementation uses ParserElement.DEFAULT_WHITE_CHARS in the default content, which appears to be out of laziness; it is only used five times outside the ParserElement class itself, four of which in the function nestedExpr (the other usage is in LineEnd, and it manually removes \n). It would be easy enough to add a named argument to nestedExpr instead, although to be fair we can also use ParserElement.setDefaultWhitespaceChars to achieve the same thing.
The second issue is that by default, whitespace chars are ignored in the content expression itself, with the additional parse action lambda t:t[0].strip(), where strip is called without input, meaning that it removes all unicode whitespace characters. I personally think it makes more sense not to ignore any whitespace within the content, but to strip them selectively in the result instead. For that reason, I removed the tokens with CharsNotIn in the original implementation, and introduced the argument stripchars which defaults to string.whitespace.

Happy to take any constructive criticism on this of course.

Thanks for making the effort to work up some working patch code - I usually get suggestions on changes *I* should make to pyparsing, but only seldom get concrete code patches/implementations. Your interpretation of `nestedExpr` is a little different from mine, I think, and I tried to accommodate different nesting rules by supporting the `content` argument, the default being 0 or more whitespace-delimited words. I may need to remove that auto-strip() parse action though if a `content` expression is given, and let the caller set necessary strip or join or whatever parse actions on the given arg. — PaulMcG, Apr 17 '17 at 01:18

Preserve newlines in nestedExpr

2 Answers2