I want to split a string into a list of words (here "word" means arbitrary sequence of non-whitespace characters), but also keep the groups of consecutive whitespaces that have been used as separators (because the number of whitespaces is significant in my data). For this simple task, I know that the following regex would do the job (I use Python as an illustrative language, but the code can be easily adapted to any language including regexes):
import re
regexA = re.compile(r"(\S+)")
print(regexA.split("aa b+b cc dd! :ee "))
produces the expected output:
['', 'aa', ' ', 'b+b', ' ', 'cc', ' ', 'dd!', ' ', ':ee', ' ']
Now the hard part: when a word includes an opening parenthesis, all the whitespaces encountered until the matching closing parenthesis should not be considered as word separators. In other words:
regexB.split("aa b+b cc(dd! :ee (ff gg) hh) ii ")
should produce:
['', 'aa', ' ', 'b+b', ' ', 'cc(dd! :ee (ff gg) hh)', ' ', 'ii', ' ']
Using
regexB = re.compile(r'([^(\s]*\([^)]*\)|\S+)')
works for a single pair of parentheses, but fails when there are inner parentheses. How could I improve the regex to correctly skip inner parentheses?
And the final question: in my data, only words starting with %
should be tested for the "parenthesis rule" (regexB
), the other words should be treated by regexA
. I have no idea how to combine two regexes in a single split.
Any hint is warmly welcome...