1

I have a string s1 = 'type1/type2/type3', so that I can simply split this by s1.split('/'), and then get ['type1', 'type2', 'type3'].

But there are also some other string like s2 = 'type1/type2/type3(a/c)', by using the method above, it will give ['type1', 'type2', 'type3(a', 'c'], this is not what I want, but ['type1', 'type2', 'type3(a/c)'] is preferred.

I want to know how to split this two kind of string format by using regex. please help me to solve this problem.

anubhava
  • 761,203
  • 64
  • 569
  • 643
JinLing
  • 9
  • 6

3 Answers3

1

You can use a negative lookahead based regex for split:

>>> import re
>>> str = 'type1/type2/type3(a/c)'
>>> print re.split(r'/(?![^()]*\))', str)
['type1', 'type2', 'type3(a/c)']

RegEx Demo

This is assuming you have ( ... ) balanced, un-nested and there are no escaped parentheses.

(?![^()]*\)) is a negative lookahead that fails the match we have a ) ahead without matching ( or ) thus failing the match when we find / inside (...).

Community
  • 1
  • 1
anubhava
  • 761,203
  • 64
  • 569
  • 643
  • I think it is not handling nesting correctly. `'A/(B/C(D/E))/F'` is balanced. – wim Sep 26 '17 at 04:52
  • Python Regex cannot handle nested parentheses and that's why I wrote my assumption in the answer. PCRE has this feature though. – anubhava Sep 26 '17 at 04:54
  • Just fyi [Python: How to match nested parentheses with regex?](https://stackoverflow.com/questions/5454322/python-how-to-match-nested-parentheses-with-regex) to corroborate what I am saying. – anubhava Sep 26 '17 at 05:09
  • As I wrote about PCRE, you are also not using Python's builtin `re` module, you are using an external library `regex` with PCRE like features. btw I am not the one who down voted your answer. – anubhava Sep 26 '17 at 05:41
  • btw even PCRE without recursion based regex won't work. [You can see your solution fails with this input `A/(B/C(D/E))/F(a/c)`](https://regex101.com/r/N45go8/2) – anubhava Sep 26 '17 at 05:45
  • Good catch. I'll have to improve the rejection side of it. Anyway I think your answer is quite fine :) It just needs to mention the nesting limitation, because the only assumption mentioned is balanced parens and unescaped. – wim Sep 26 '17 at 05:47
  • Fixed mine for `A/(B/C(D/E))/F(a/c)`. Let me know if you see any other pathological cases... – wim Sep 26 '17 at 05:59
  • Unfortunately that won't work for many inputs, e.g. `A/((a/c)D/E)` – anubhava Sep 26 '17 at 07:30
0

There is a technique available which uses features from a more powerful regex implementation. Don't worry, it's backwards-compatible with the standard re module. The basic idea is also possible in standard re, but it's a bit more fiddly - I will outline the method for stdlib module at the end of this answer.

# pip install regex
import regex as re
s1 = 'type1/type2/type3'
s2 = 'type1/type2/type3(a/c)'
s3 = 'A/(B/C(D/E))/F'
s4 = 'A/(B/C(D/E))/F(a/c)'

Here's the pattern:

pat = r'\(.*?\)(*SKIP)(*FAIL)|/'

Demo:

>>> re.split(pat, s1)
['type1', 'type2', 'type3']
>>> re.split(pat, s2)
['type1', 'type2', 'type3(a/c)']
>>> re.split(pat, s3)
['A', '(B/C(D/E))', 'F']
>>> re.split(pat, s4)
['A', '(B/C(D/E))', 'F(a/c)']

How it works? Read the regex like this:

blacklisted(*SKIP)(*FAIL)|matched

This pattern first discards anything enclosed in non-greedy parens, i.e. \(.*?\), and that's where we used the (*SKIP)(*FAIL) feature, which is not there in stdlib re yet. Then it matches what's left on the righthand side of the |, i.e. a slash.

As I mentioned, the technique is also possible in standard re, but you have to use capture groups. The pattern will need a capture group surrounding the slash on the right side:

pat_ = r'\(.*?\)|(/)'

Group 1 will be set for the "good" matches. So iterating like this:

>>> for match in re.finditer(pat_, s):
...     if match[1] is not None:
...         print(match.start())

Will print out the indices that you need to split at. It's trivial then to split the string programmatically. You can actually do it in regex directly with using re.sub and re.split, but it's cleaner and easier just to do the split in Python code directly once you have the indices.

wim
  • 338,267
  • 99
  • 616
  • 750
  • Honestly, I haven't read any of the downvoted answers but will delete my answer now :) – Jan Sep 26 '17 at 07:53
0

Non-regex alternative

I know you tagged , but these kinds of problems are not well-suited to regular expressions. There are many tricky edge cases, and the failure mode for edge cases is often returning incorrect results, when you would prefer an exception raised instead.

You have to choose the lesser of two evils: a simple regex which misbehaves on weird inputs, or a monster regex which is incomprehensible to everyone except the regex engine itself.

It's often easier just by writing a little parser that keeps track of whether you're enclosed in parens or not. That's simple to write, and simple to maintain.

Here's a parser based solution and a barrage of tests that might trip up any regex based approach. This will also detect when the problem is poorly constrained (unbalanced parens), and raise if necessary.

def fancy_split(s):
    parts, accumulator, nesting_level = [], [], 0
    for char in s:
        if char == '/':
            if nesting_level == 0:
                parts.append(''.join(accumulator))
                accumulator = []
                continue
        accumulator.append(char)
        if char == '(':
            nesting_level += 1
        elif char == ')':
            nesting_level -= 1
            if nesting_level < 0:
                raise Exception('unbalanced parens')
    parts.append(''.join(accumulator))
    if nesting_level != 0:
        raise Exception('unbalanced parens')
    assert '/'.join(parts) == s
    return parts

tests = {
    'type1/type2/type3': ['type1', 'type2', 'type3'],
    'type1/type2/type3(a/c)': ['type1', 'type2', 'type3(a/c)'],
    'A/(B/C(D/E))/F': ['A', '(B/C(D/E))', 'F'],
    'A/(B/C(D/E))/F(a/c)': ['A', '(B/C(D/E))', 'F(a/c)'],
    'A/((a/c)D/E)': ['A', '((a/c)D/E)'],
    'A': ['A'],
    'A/': ['A', ''],
    '/A': ['', 'A'],
    '': [''],
    '/': ['', ''],
    '//': ['', '', '']
}

for input_, expected_output in tests.items():
    assert fancy_split(input_) == expected_output
wim
  • 338,267
  • 99
  • 616
  • 750