2

Related to : Python parsing bracketed blocks

I have a file with the following format :

#
here
are
some
strings
#
and
some
others
 #
 with
 different
 levels
 #
 of
  #
  indentation
  #
 #
#

So a block is defined by a starting #, and a trailing #. However, the trailing # of the n-1th block is also the starting # of the nth block.

I am trying to write a function that, given this format, would retrieve the content of each blocks, and that could also be recursive.

To start with, I started with regexes but I abandonned quite fast (I think you guessed why), so I tried using pyparsing, yet I can't simply write

print(nestedExpr('#','#').parseString(my_string).asList())

Because it raises a ValueError Exception (ValueError: opening and closing strings cannot be the same).

Knowing that I cannot change the input format, do I have any better option than pyparsing for this one ?

I also tried using this answer : https://stackoverflow.com/a/1652856/740316, and replaced the {/} with #/# yet it fails to parse the string.

Community
  • 1
  • 1
  • Maybe just do a replacement on the nth `#` with something like `#-#` then split or parse it by `-`. – l'L'l Apr 08 '15 at 19:15
  • I have no way to change the input format, sadly... –  Apr 08 '15 at 19:26
  • Are you wanting to separate all levels as strings or just outer ones (eg. all sub-levels of string would be included as that string). And what about removing spaces, tabs in the strings? – l'L'l Apr 08 '15 at 20:04
  • Well at first I thought about your latter proposition, then I would repeat the method on the output string recursively, so the first step would give me only the outer levels with other nested blocks as raw strings, then iterating over these would give me the content of these nested blocks, etc. So basically, the answer to your question is your first proposition, but I would like to keep the trace of the level of the block. I don't want all the blocks to be "flattened" in a 1 dimension list, if you see what I mean. –  Apr 08 '15 at 20:19

1 Answers1

1

Unfortunately (for you), your grouping is not dependent only on the separating '#' characters, but also on the indent levels (otherwise, ['with','different','levels'] would be at the same level as the previous group ['and','some','others']). Parsing indent-sensitive grammars is not a strong suit for pyparsing - it can be done, but it is not pleasant. To do so we will use the pyparsing helper macro indentedBlock, which also requires that we define a list variable that indentedBlock can use for its indentation stack.

See the embedded comments in the code below to see how you might use one approach with pyparsing and indentedBlock:

from pyparsing import *

test = """\
#
here
are
some
strings
#
and
some
others
 #
 with
 different
 levels
 #
 of
  #
  indentation
  #
 #
#"""

# newlines are significant for line separators, so redefine 
# the default whitespace characters for whitespace skipping
ParserElement.setDefaultWhitespaceChars(' ')

NL = LineEnd().suppress()
HASH = '#'
HASH_SEP = Suppress(HASH + Optional(NL))

# a normal line contains a single word
word_line = Word(alphas) + NL


indent_stack = [1]

# word_block is recursive, since word_blocks can contain word_blocks
word_block = Forward()
word_group = Group(OneOrMore(word_line | ungroup(indentedBlock(word_block, indent_stack))) )

# now define a word_block, as a '#'-delimited list of word_groups, with 
# leading and trailing '#' characters
word_block <<= (HASH_SEP + 
                 delimitedList(word_group, delim=HASH_SEP) + 
                 HASH_SEP)

# the overall expression is one large word_block
parser = word_block

# parse the test string
parser.parseString(test).pprint()

Prints:

[['here', 'are', 'some', 'strings'],
 ['and',
  'some',
  'others',
  [['with', 'different', 'levels'], ['of', [['indentation']]]]]]
PaulMcG
  • 62,419
  • 16
  • 94
  • 130