1

I'm working on a simple wiki engine, and I am wondering if there is an efficient way to split a string into a list based on a separator, but only if that separator is not enclosed with double square brackets or double curly brackets.

So, a string like this:

"|Row 1|[[link|text]]|{{img|altText}}|"

Would get converted to a list like this:

['Row 1', '[[link|text]]', '{{img|altText}}']

EDIT: Removed the spaces from the example string, since they were causing confusion.

BenMorel
  • 34,448
  • 50
  • 182
  • 322
Zauberin Stardreamer
  • 1,284
  • 1
  • 13
  • 22
  • In your example, you use the separator `'|'` within the double curly/square brackets and at the beginning/end of the string, and the separator `' | '` otherwise. Are the spaces part of the separator or can you not assume anything about them? – mdml Oct 05 '13 at 21:14
  • There are tons of MediaWiki parsers out there for Python: http://www.mediawiki.org/wiki/Alternative_parsers – Blender Oct 05 '13 at 21:14
  • @mtitan8: The spaces were only added for readability, and I've removed them. – Zauberin Stardreamer Oct 05 '13 at 21:21
  • And I'm not writing a MediaWiki parser, I'm writing a customized parser for Creole because CreoleParser does not handle utf-8 gracefully, due to its dependency on Genshi. – Zauberin Stardreamer Oct 05 '13 at 21:23
  • Tried to figure out how to adapt the regexp in http://stackoverflow.com/questions/4780728/regex-split-string-preserving-quotes to the quoting here but failing to `re.compile()` with `re.VERBOSE`. – Erik Kaplun Oct 05 '13 at 21:36
  • Since there is an empty string before the first `|` and after the last `|`, the result would be `['', 'Row 1', '[[link|text]]', '{{img|altText}}', '']`, wouldn't it? – Tim Pietzcker Oct 05 '13 at 22:01
  • Aye, though that's taken care of prior to parsing. – Zauberin Stardreamer Oct 05 '13 at 22:29

3 Answers3

3

You can use

def split_special(subject):
    return re.split(r"""
        \|           # Match |
        (?!          # only if it's not possible to match...
         (?:         # the following non-capturing group:
          (?!\[\[)   # that doesn't contain two square brackets
          .          # but may otherwise contain any character
         )*          # any number of times,
         \]\]        # followed by ]]
        )            # End of first loohahead. Now the same thing for braces:
        (?!(?:(?!\{\{).)*\}\})""", 
        subject, flags=re.VERBOSE)

Result:

>>> s = "|Row 1|[[link|text|df[sdfl|kj]|foo]]|{{img|altText|{|}|bar}}|"
>>> split_special(s)
['', 'Row 1', '[[link|text|df[sdfl|kj]|foo]]', '{{img|altText|{|}|bar}}', '']

Note the leading and trailing empty strings - they need to be there because they do exist before your first and after your last | in the test string.

Tim Pietzcker
  • 328,213
  • 58
  • 503
  • 561
1

Tim's expression is elaborate, but you can usually greatly simplify "split" expressions by converting them to "match" ones:

import re
s = "|Row 1|[[link|text|df[sdfl|kj]|foo]]|{{img|altText|{|}|bar}}|"

print re.findall(r'\[\[.+?\]\]|{{.+?}}|[^|]+', s)

# ['Row 1', '[[link|text|df[sdfl|kj]|foo]]', '{{img|altText|{|}|bar}}']
georg
  • 211,518
  • 52
  • 313
  • 390
-2

Is it possible to have Row 1|[? If the separator is always surrounded by spaces like your above example, you can do

split(" | ")
Tommy
  • 12,588
  • 14
  • 59
  • 110
  • @Tommy: how about `"Bla|Bla|Bla"`? – Erik Kaplun Oct 05 '13 at 21:20
  • @ErikAllik notice that in my answer I *asked if it was possible to have | with no surrounding spaces. Moreover, my answer said "if" its always surrounded by spaces, so I did not attempt to handle your blah case. – Tommy Oct 05 '13 at 22:44