regex split on wrapping patterns

Question

So I am honestly just stumped, the goal is to split on a wrapper, but not the same wrapper if it is in something being wrapped.

take the following string:

s = 'something{now I am wrapped {I should not cause splitting} I am still wrapped}something else'

the resulting list should be ['something','{','now I am wrapped {I should not cause splitting} I am still wrapped','}','something else']

The simplest thing I tried was a findall to see how this might work, but since regex has no memory, it doesn't consider wrapping and so it ends as soon as it finds another ending bracket. Here is what happened:

>>> s = 'something{now I am wrapped {I should not cause splitting} I am still wrapped}something else'
>>> re.findall(r'{.*?}',s)
['{now I am wrapped {I should not cause splitting}']

any ideas as to how I could get it to recognize not to recognize if it's part of an inner wrapper?

regex is a simple state machine with no memory as such it does not handle nesting tokens well ... you need to look at something like yacc/lexx (python has `ply` module) see this related question http://stackoverflow.com/questions/5454322/python-how-to-match-nested-parentheses-with-regex — Joran Beasley, Sep 13 '13 at 16:19
that suggests to either download a parser (which I can't do because this is for a module and I don't want to require that) or to iterate through each character, which seems unnecessary for something so simple — Ryan Saxe, Sep 13 '13 at 16:37
arbitrary nesting depths is not a simple problem .... you need a parser that has memory (not regex) — Joran Beasley, Sep 13 '13 at 16:38
Don't use only one regex, if you don't want to use lexx or something like that. You can first match the pair by lazy quantifier, like what you've done, and then test whether there is embedded pair in your first match. If there is, add the trailing part to your match. The memory can be done using python rather than regex itself. — Herrington Darkholme, Sep 13 '13 at 16:39
@JoranBeasley: that is supposed to say seeming so simple, because it is clearly now a complicated problem although it really looks as if it shouldn't be this difficult — Ryan Saxe, Sep 13 '13 at 16:42
well its not overly difficult ... but it is one where the parser needs a context awareness (or memory) — Joran Beasley, Sep 13 '13 at 17:24

thinker3 · Answer 1 · 2013-09-13T22:12:43.177

1

s = 'something{now I am wrapped {I should not cause splitting} I am still wrapped}something else'
m = re.search(r'(.*)({)(.*?{.*?}.*?)(})(.*)', s)
print m.groups()

new answer:

s = 'something{now I am wrapped {I should {not cause} splitting} I am still wrapped}something else'
m = re.search(r'([^{]*)({)(.*)(})([^}]*)', s)
print m.groups()

edited Sep 13 '13 at 22:12

answered Sep 13 '13 at 16:32

thinker3

12,771
5
30
36

that only works for my exact example, not some file where I don't know where the brackets are and how it is ordered – Ryan Saxe Sep 13 '13 at 16:35
it actually works pretty good for all test cases I just ran it on... but i doubt it will work with more than 2 levels of nesting – Joran Beasley Sep 13 '13 at 16:36
fails on `s = 'something{now I am wrapped {I should {not} cause splitting} I am still wrapped}something'` but it should work for exactly one level of nesting – Joran Beasley Sep 13 '13 at 16:37
1

not even for just multiple levels of nesting, but it fails if there are multiple instances. on the string created from `s+=s` it also fails – Ryan Saxe Sep 13 '13 at 16:39

score 0 · Answer 2 · answered Sep 13 '13 at 16:25

not sure if this always will do what you want, but you could use partition and rpartition, like:

In [26]: s_1 = s.partition('{')
In [27]: s_1
Out[27]: 
('something',
 '{',
 'now I am wrapped {I should not cause splitting} I am still wrapped}something else')
In [30]: s_2 = s_1[-1].rpartition('}')
In [31]: s_2
Out[31]: 
('now I am wrapped {I should not cause splitting} I am still wrapped',
 '}',
 'something else')
In [34]: s_out = s_1[0:-1] + s_2
In [35]: s_out
Out[35]: 
('something',
 '{',
 'now I am wrapped {I should not cause splitting} I am still wrapped',
 '}',
 'something else')

no because I will be using this to organize a file that has many strings like this. the actual 'something' may contain wrapped strings that need to be split as well. Does that make sense? — Ryan Saxe, Sep 13 '13 at 16:31

score 0 · Accepted Answer · answered Sep 13 '13 at 17:28

Based on all the responses, I decided to just write a function that takes the string and the wrappers and outputs the list using brute iteration:

def f(string,wrap1,wrap2):
    wrapped = False
    inner = 0
    count = 0
    holds = ['']
    for i,c in enumerate(string):
        if c == wrap1 and not wrapped:
            count += 2
            wrapped = True
            holds.append(wrap1)
            holds.append('')
        elif c == wrap1 and wrapped:
            inner += 1
            holds[count] += c
        elif c == wrap2 and wrapped and inner > 0:
            inner -= 1
            holds[count] += c
        elif c == wrap2 and wrapped and inner == 0:
            wrapped = False
            count += 2
            holds.append(wrap2)
            holds.append('')
        else:
            holds[count] += c
    return holds

and now this shows it working:

>>> s = 'something{now I am wrapped {I should not cause splitting} I am still wrapped}something else'
>>> f(s,'{','}')
['something', '{', 'now I am wrapped {I should not cause splitting} I am still wrapped', '}', 'something else']

i can't for 2 days if it's my own – Ryan Saxe Sep 14 '13 at 02:05 — Ryan Saxe, Sep 14 '13 at 02:05

Birei · Answer 4 · 2013-09-15T15:36:50.743

You can solve this problem using the Scanner of the re module:

Using following list of strings as test:

l = ['something{now I am wrapped {I should not cause splitting} I am still wrapped}everything else',
     'something{now I am wrapped} here {and there} listen',
     'something{now I am wrapped {I should {not} cause splitting} I am still wrapped}everything',
     'something{now {I {am}} wrapped {I should {{{not}}} cause splitting} I am still wrapped}everything']

Create a class where I will keep state of the number of opened and closed curly braces, besides of the text between both edges of them. It has three methods, one when matches an opening curly braces, other for the closing one, and the last one for the text between both. Depends if the stack (opened_cb variable) is empty, I do different actions:

class Cb():

    def __init__(self, results=None):
        self.results = []
        self.opened_cb = 0

    def s_text_until_cb(self, scanner, token):
        if self.opened_cb == 0:
            return token
        else:
            self.results.append(token)
            return None

    def s_opening_cb(self, scanner, token):
        self.opened_cb += 1
        if self.opened_cb == 1:
            return token
        self.results.append(token)
        return None

    def s_closing_cb(self, scanner, token):
        self.opened_cb -= 1
        if self.opened_cb == 0:
            t = [''.join(self.results), token]
            self.results.clear()
            return t
        else:
            self.results.append(token)
            return None

And last, I create the Scanner and join the results in a plain list:

for s in l:
    results = []
    cb = Cb()
    scanner = re.Scanner([
        (r'[^{}]+', cb.s_text_until_cb),
        (r'[{]', cb.s_opening_cb),
        (r'[}]', cb.s_closing_cb),
    ])
    r = scanner.scan(s)[0]
    for elem in r:
        if isinstance(elem, list):
            results.extend(elem)
        else:
            results.append(elem)
    print('Original string --> {0}\nResult --> {1}\n\n'.format(s, results))

Here the complete program and an execution to see the results:

import re

l = ['something{now I am wrapped {I should not cause splitting} I am still wrapped}everything else',
     'something{now I am wrapped} here {and there} listen',
     'something{now I am wrapped {I should {not} cause splitting} I am still wrapped}everything',
     'something{now {I {am}} wrapped {I should {{{not}}} cause splitting} I am still wrapped}everything']


class Cb():

    def __init__(self, results=None):
        self.results = []
        self.opened_cb = 0

    def s_text_until_cb(self, scanner, token):
        if self.opened_cb == 0:
            return token
        else:
            self.results.append(token)
            return None

    def s_opening_cb(self, scanner, token):
        self.opened_cb += 1
        if self.opened_cb == 1:
            return token
        return None

    def s_closing_cb(self, scanner, token):
        self.opened_cb -= 1
        if self.opened_cb == 0:
            t = [''.join(self.results), token]
            self.results.clear()
            return t
        else:
            self.results.append(token)
            return None

for s in l:
    results = []
    cb = Cb()
    scanner = re.Scanner([
        (r'[^{}]+', cb.s_text_until_cb),
        (r'[{]', cb.s_opening_cb),
        (r'[}]', cb.s_closing_cb),
    ])
    r = scanner.scan(s)[0]
    for elem in r:
        if isinstance(elem, list):
            results.extend(elem)
        else:
            results.append(elem)
    print('Original string --> {0}\nResult --> {1}\n\n'.format(s, results))

Run it like:

python3 script.py

That yields:

Original string --> something{now I am wrapped {I should not cause splitting} I am still wrapped}everything else
Result --> ['something', '{', 'now I am wrapped {I should not cause splitting} I am still wrapped', '}', 'everything else']


Original string --> something{now I am wrapped} here {and there} listen
Result --> ['something', '{', 'now I am wrapped', '}', ' here ', '{', 'and there', '}', ' listen']


Original string --> something{now I am wrapped {I should {not} cause splitting} I am still wrapped}everything
Result --> ['something', '{', 'now I am wrapped {I should {not} cause splitting} I am still wrapped', '}', 'everything']


Original string --> something{now {I {am}} wrapped {I should {{{not}}} cause splitting} I am still wrapped}everything
Result --> ['something', '{', 'now {I {am}} wrapped {I should {{{not}}} cause splitting} I am still wrapped', '}', 'everything']

@RyanSaxe: I don't know. It seems to work with all the examples that I saw in the question and some of the answers. — Birei, Sep 15 '13 at 21:14

regex split on wrapping patterns

4 Answers4