Python regex split into characters except if followed by parentheses

Question

I have a string like "F(230,24)F[f(22)_(23);(2)%[+(45)FF]]", where each character except for parentheses and what they enclose represents a kind of instruction. A character can be followed by an optional list of arguments specified in optional parentheses.

Such a string i would like to split the string into ['F(230,24)', 'F', '[', 'f(22)', '_(23)', ';(2)', '%', '[', '+(45)', 'F', 'F', ']', ']'], however at the moment i only get ['F(230,24)', 'F', '[', 'f(22)_(23);(2)', '%', '[', '+(45)', 'F', 'F', ']', ']'] (a substring was not split correctly).

Currently i am using list(filter(None, re.split(r'([A-Za-z\[\]\+\-\^\&\\\/%_;~](?!\())', string))), which is just a mess of characters and a negative lookahead for (. list(filter(None, <list>)) is used to remove empty strings from the result.

I am aware that this is likely caused by Python's re.split having been designed not to split on a zero length match, as discussed here. However i was wondering what would be a good solution? Is there a better way than re.findall?

Thank you.

EDIT: Unfortunately i am not allowed to use custom packages like regex module

What is your regex? It also might be helpful to include why you are doing this, as there may be an easier way. — Artyer, Jul 27 '16 at 18:31
You could just do `re.findall(s, "([^()](\\([^)]+\\))?)")` (Replacing the `[^()]` for all the characters you want) — Artyer, Jul 27 '16 at 18:35
For now yes. But in fact they might be strings as well in the future, so i guess i am a lot more happy with the `findall` approach. — Nikole, Jul 27 '16 at 18:52

score 2 · Answer 1 · edited May 23 '17 at 12:18

2

I am aware that this is likely caused by Python's re.split having been designed not to split on a zero length match

You can use the VERSION1 flag of the regex module. Taking that example from the thread you've linked - see how split() produces zero-width matches as well:

>>> import regex as re
>>> re.split(r"\s+|\b", "Split along words, preserve punctuation!", flags=re.V1)
['', 'Split', 'along', 'words', ',', 'preserve', 'punctuation', '!']

edited May 23 '17 at 12:18

Community

1
1

answered Jul 27 '16 at 18:34

alecxe

462,703
120
1,088
1,195

unfortunately i cant use custom packages. – Nikole Jul 27 '16 at 18:42

Psidom · Accepted Answer · 2016-07-27T18:46:11.847

2

You can use re.findall to find out all single character optionally followed by a pair of parenthesis:

import re
s = "F(230,24)F[f(22)_(23);(2)%[+(45)FF]]"
re.findall("[^()](?:\([^()]*\))?", s)

['F(230,24)',
 'F',
 '[',
 'f(22)',
 '_(23)',
 ';(2)',
 '%',
 '[',
 '+(45)',
 'F',
 'F',
 ']',
 ']']

[^()] match a single character except for parenthesis;
(?:\([^()]*\))? denotes a non-capture group(?:) enclosed by a pair of parenthesis and use ? to make the group optional;

edited Jul 27 '16 at 18:46

answered Jul 27 '16 at 18:36

Psidom

209,562
33
339
356

1

This solves the pattern much more elegant than via split. Thank you! :) – Nikole Jul 27 '16 at 18:43
Just for clarification: The non-capture group is used to assign the ? to make that group optional, but not have it as a capture group such that when there are no explicit capture groups the whole match is returned? – Nikole Jul 28 '16 at 22:18
1

Exactly. It is necessary because `?` only makes the character right before it optional, and if we want to make a pattern optional we have to group it. – Psidom Jul 28 '16 at 22:25

score 1 · Answer 3 · answered Jul 27 '16 at 19:49

Another solution. This time the pattern recognize strings with the structure SYMBOL[(NUMBER[,NUMBER...])]. The function parse_it returns True and the tokens if the string match with the regular expression and False and empty if don't match.

import re
def parse_it(string):
    '''
    Input: String to parse
    Output: True|False, Tokens|empty_string
    '''
    pattern = re.compile('[A-Za-z\[\]\+\-\^\&\\\/%_;~](?:\(\d+(?:,\d+)*\))?')
    tokens = pattern.findall(string)
    if ''.join(tokens) == string:
        res = (True, tokens)
    else:
        res = (False, '')
    return res

good_string = 'F(230,24)F[f(22)_(23);(2)%[+(45)FF]]'
bad_string = 'F(2a30,24)F[f(22)_(23);(2)%[+(45)FF]]' # There is an 'a' in a bad place.

print(parse_it(good_string))
print(parse_it(bad_string))

Output:

(True, ['F(230,24)', 'F', '[', 'f(22)', '_(23)', ';(2)', '%', '[', '+(45)', 'F', 'F', ']', ']'])
(False, '')

Python regex split into characters except if followed by parentheses

3 Answers3