0

I want to split a string into a list of words (here "word" means arbitrary sequence of non-whitespace characters), but also keep the groups of consecutive whitespaces that have been used as separators (because the number of whitespaces is significant in my data). For this simple task, I know that the following regex would do the job (I use Python as an illustrative language, but the code can be easily adapted to any language including regexes):

import re
regexA = re.compile(r"(\S+)")
print(regexA.split("aa b+b   cc dd!    :ee  "))

produces the expected output:

['', 'aa', ' ', 'b+b', '   ', 'cc', ' ', 'dd!', '    ', ':ee', '  ']

Now the hard part: when a word includes an opening parenthesis, all the whitespaces encountered until the matching closing parenthesis should not be considered as word separators. In other words:

regexB.split("aa b+b   cc(dd! :ee (ff gg) hh) ii  ")

should produce:

['', 'aa', ' ', 'b+b', '   ', 'cc(dd! :ee (ff gg) hh)', ' ', 'ii', '  ']

Using

regexB = re.compile(r'([^(\s]*\([^)]*\)|\S+)')

works for a single pair of parentheses, but fails when there are inner parentheses. How could I improve the regex to correctly skip inner parentheses?

And the final question: in my data, only words starting with % should be tested for the "parenthesis rule" (regexB), the other words should be treated by regexA. I have no idea how to combine two regexes in a single split.

Any hint is warmly welcome...

sciroccorics
  • 2,357
  • 1
  • 8
  • 21
  • Regex can't match nested structures like parentheses. You'll have to write some code. – Aran-Fey Apr 10 '18 at 23:54
  • @Aran: I agree for the general case, but in my case, I know that there is at most one inner pair of parenthesis. Does this constraint change the pb? – sciroccorics Apr 10 '18 at 23:56
  • As Aran-Fey said, regex can't understand nesting. For example, for a string like `(a b (c d) e f)`, if your regex is non-greedy, then it will match `(a b (c d)` and `(c d)`. If it's greedy, it will match `(a b (c d) e f)` and `(c d) e f)`. Both of those are problematic, in different ways. – Niayesh Isky Apr 11 '18 at 00:45
  • However, Python does have parsing libraries that you might want to look into, as explained in this answer: [Matching Nested Structures With Regular Expressions in Python](https://stackoverflow.com/a/1101046/7315159) – Niayesh Isky Apr 11 '18 at 00:47
  • [Here is a hint](http://rextester.com/RJV53548). – Wiktor Stribiżew Apr 11 '18 at 06:40
  • @Wiktor: Using `finditer` instead of `split` is a very clever approach to solve the problem. Thanks a lot! I'm going to write a summary here, as I am not sure that the link on **rextester** lasts very long – sciroccorics Apr 11 '18 at 09:27
  • I can post the answer if you wish to accept. Note that PyPi regex module is used there. Not the regular `re`. – Wiktor Stribiżew Apr 11 '18 at 09:30
  • @Wiktor: Yes, after reading the answer by @Thm, I understood that your solution also requires subroutines which are not supported by the standard `re`. So I mixed your respective answers to get a rather clean solution working in standard `re`... – sciroccorics Apr 11 '18 at 10:33

2 Answers2

1

In the PCRE regex engine, sub-routine is supported and recursive pattern seems workable for the case including balanced nested parentheses.

(?m)\s+(?=[^()]*(\([^()]*(?1)?[^()]*\))*[^()]*$)

Demo,,, in which (?1) means calling sub-routine 1, (\([^()]*(?1)?[^()]*\)), namely recursive pattern which includes caller, (?1)

But python does not support sub-routinepattern in regex.

So I tried first replacing every ( , ) with another distinctive character( @ in this example) and applying the regex to split and finally turn @ back to ( or ) respectively in my pythone script.

Regex for spliting.

(?m)(\s+)(?=[^@]*(?:(?:@[^@]*){2})*$)

Demo,,, in which I changed your separator \S+ to consecutive spaces \s+ because @,(,) are included in [\S]' possible characters set.

Python script may be like this

import re
ss="""aa b+b   cc(dd! :ee ((ff gg)) hh) ii  """
ss=re.sub(r"\(|\)","@",ss)      #repacing every `(`,`)` to `@`

regx=re.compile(r"(?m)(\s+)(?=[^@]*(?:(?:@[^@]*){2})*$)")
m=regx.split(ss)
for i in range(len(m)):         # turn `@` back to `(` or `)` respectively 
    n= m[i].count('@')
    if n < 2: continue
    else: 
        for j in range(int(n/2)):
            k=m[i].find('@'); m[i]=m[i][:k]+'('+m[i][k+1:]
        m[i]= m[i].replace("@",')')
print(m)

Output is

['aa', ' ', 'b+b', '   ', 'cc(dd! :ee ((ff gg)) hh)', ' ', 'ii', '  ', '']
Thm Lee
  • 1,236
  • 1
  • 9
  • 12
  • It took me some time, after reading your answer, to understand the power of subroutines (too bad that the standard `re` module does not support them). But I like your substitution approach that works with the standard `re` module. I just twisted it a little bit by including @Wiktor's `finditer` idea to get shorter and cleaner code. – sciroccorics Apr 11 '18 at 10:25
  • Thanks for your info. Wiktor's method is great:-) – Thm Lee Apr 11 '18 at 15:48
0

Finally after having tested several ideas based on the answers proposed by @Wiktor Stribiżew and @Thm Lee, I came to bunch of solutions dealing with different levels of complexity. To reduce dependency, I wanted to stick to the re module from the Python standard library, so here is the code:

import re

text = "aa b%b(   %cc(dd! (:ee ff) gg) %hh ii)  "

# Solution 1: don't process parentheses at all
regexA = re.compile(r'(\S+)')
print(regexA.split(text))

# Solution 2: works for non-nested parentheses
regexB = re.compile(r'(%[^(\s]*\([^)]*\)|\S+)')
print(regexB.split(text))

# Solution 3: works for one level of nested parentheses
regexC = re.compile(r'(%[^(\s]*\((?:[^()]*\([^)]*\))*[^)]*\)|\S+)')
print(regexC.split(text))

# Solution 4: works for arbitrary levels of nested parentheses
n, words = 0, []
for word in regexA.split(text):
    if n: words[-1] += word
    else: words.append(word)
    if n or (word and word[0] == '%'):
        n += word.count('(') - word.count(')')
print(words)

Here is the generated output:

Solution 1: ['', 'aa', ' ', 'b%b(', '   ', '%cc(dd!', ' ', '(:ee', ' ', 'ff)', ' ', 'gg)', ' ', '%hh', ' ', 'ii)', '  ']
Solution 2: ['', 'aa', ' ', 'b%b(', '   ', '%cc(dd! (:ee ff)', ' ', 'gg)', ' ', '%hh', ' ', 'ii)', '  ']
Solution 3: ['', 'aa', ' ', 'b%b(', '   ', '%cc(dd! (:ee ff) gg)', ' ', '%hh', ' ', 'ii)', '  ']
Solution 4: ['', 'aa', ' ', 'b%b(', '   ', '%cc(dd! (:ee ff) gg)', ' ', '%hh', ' ', 'ii)', '  ']

As stated in the OP, for my specific data, escaping whitespaces in parentheses has only to be done for words starting with %, other parentheses (e.g. word b%b( in my example) are not considered are special. If you want to escape whitespaces inside any pair of parentheses, simply remove the %char in the regexes. Here is the result with that modification:

Solution 1: ['', 'aa', ' ', 'b%b(', '   ', '%cc(dd!', ' ', '(:ee', ' ', 'ff)', ' ', 'gg)', ' ', '%hh', ' ', 'ii)', '  ']
Solution 2: ['', 'aa', ' ', 'b%b(   %cc(dd! (:ee ff)', ' ', 'gg)', ' ', '%hh', ' ', 'ii)', '  ']
Solution 3: ['', 'aa', ' ', 'b%b(   %cc(dd! (:ee ff) gg)', ' ', '%hh', ' ', 'ii)', '  ']
Solution 4: ['', 'aa', ' ', 'b%b(   %cc(dd! (:ee ff) gg) %hh ii)', '  ']
sciroccorics
  • 2,357
  • 1
  • 8
  • 21