3

I have a string like "F(230,24)F[f(22)_(23);(2)%[+(45)FF]]", where each character except for parentheses and what they enclose represents a kind of instruction. A character can be followed by an optional list of arguments specified in optional parentheses.

Such a string i would like to split the string into ['F(230,24)', 'F', '[', 'f(22)', '_(23)', ';(2)', '%', '[', '+(45)', 'F', 'F', ']', ']'], however at the moment i only get ['F(230,24)', 'F', '[', 'f(22)_(23);(2)', '%', '[', '+(45)', 'F', 'F', ']', ']'] (a substring was not split correctly).

Currently i am using list(filter(None, re.split(r'([A-Za-z\[\]\+\-\^\&\\\/%_;~](?!\())', string))), which is just a mess of characters and a negative lookahead for (. list(filter(None, <list>)) is used to remove empty strings from the result.

I am aware that this is likely caused by Python's re.split having been designed not to split on a zero length match, as discussed here. However i was wondering what would be a good solution? Is there a better way than re.findall?

Thank you.

EDIT: Unfortunately i am not allowed to use custom packages like regex module

Community
  • 1
  • 1
Nikole
  • 322
  • 3
  • 17

3 Answers3

2

I am aware that this is likely caused by Python's re.split having been designed not to split on a zero length match

You can use the VERSION1 flag of the regex module. Taking that example from the thread you've linked - see how split() produces zero-width matches as well:

>>> import regex as re
>>> re.split(r"\s+|\b", "Split along words, preserve punctuation!", flags=re.V1)
['', 'Split', 'along', 'words', ',', 'preserve', 'punctuation', '!']
Community
  • 1
  • 1
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
2

You can use re.findall to find out all single character optionally followed by a pair of parenthesis:

import re
s = "F(230,24)F[f(22)_(23);(2)%[+(45)FF]]"
re.findall("[^()](?:\([^()]*\))?", s)

['F(230,24)',
 'F',
 '[',
 'f(22)',
 '_(23)',
 ';(2)',
 '%',
 '[',
 '+(45)',
 'F',
 'F',
 ']',
 ']']
  • [^()] match a single character except for parenthesis;
  • (?:\([^()]*\))? denotes a non-capture group(?:) enclosed by a pair of parenthesis and use ? to make the group optional;
Psidom
  • 209,562
  • 33
  • 339
  • 356
  • 1
    This solves the pattern much more elegant than via split. Thank you! :) – Nikole Jul 27 '16 at 18:43
  • Just for clarification: The non-capture group is used to assign the ? to make that group optional, but not have it as a capture group such that when there are no explicit capture groups the whole match is returned? – Nikole Jul 28 '16 at 22:18
  • 1
    Exactly. It is necessary because `?` only makes the character right before it optional, and if we want to make a pattern optional we have to group it. – Psidom Jul 28 '16 at 22:25
1

Another solution. This time the pattern recognize strings with the structure SYMBOL[(NUMBER[,NUMBER...])]. The function parse_it returns True and the tokens if the string match with the regular expression and False and empty if don't match.

import re
def parse_it(string):
    '''
    Input: String to parse
    Output: True|False, Tokens|empty_string
    '''
    pattern = re.compile('[A-Za-z\[\]\+\-\^\&\\\/%_;~](?:\(\d+(?:,\d+)*\))?')
    tokens = pattern.findall(string)
    if ''.join(tokens) == string:
        res = (True, tokens)
    else:
        res = (False, '')
    return res

good_string = 'F(230,24)F[f(22)_(23);(2)%[+(45)FF]]'
bad_string = 'F(2a30,24)F[f(22)_(23);(2)%[+(45)FF]]' # There is an 'a' in a bad place.

print(parse_it(good_string))
print(parse_it(bad_string))

Output:

(True, ['F(230,24)', 'F', '[', 'f(22)', '_(23)', ';(2)', '%', '[', '+(45)', 'F', 'F', ']', ']'])
(False, '')

Jose Raul Barreras
  • 849
  • 1
  • 13
  • 19