Is there a way to have python regex find the next instance of a character instead of the last instance?

Question

I'm a noob with RegEx patterns, so I'll apologize right away. :) I'm trying to take this string (note that there are some nested parentheses):

"(i) Test text (see § 123.1 of this Proper Name (PN) subparagraph) and additional test text (see § 123.2 of this subparagraph) along with more test text (including a parenthetical) to test the text. (See § 125.3 of this subparagraph for even more test text.) (Hello World!)"

And remove all the parenthetical statements which begin with "(See §" or "(see §" in order to get the following result:

"(i) Test text and additional test text along with more test text (including a parenthetical) to test the text. (Hello World!)"

I've tried using .split() and re.sub(), but can't seem to find a good solution. Here's the closest I've gotten:

import re

txt = '(i) Test text (see § 123.1 of this Proper Name (PN) subparagraph) and additional test text (see § 123.2 of this subparagraph) along with more test text (including a parenthetical) to test the text. (See § 125.3 of this subparagraph for even more test text.) (Hello World!)'

x = re.sub(r'\((s|S)ee §.*\)', r'', txt)

print(x)

It seems the that the '.*\)' is finding the last instance of ')' as opposed to the next instance. Is there a way to override this behavior, or rather, is there a better solution that I've missed entirely?

Thanks very much!

I think this will help you out: https://stackoverflow.com/questions/2013124/regex-matching-up-to-the-first-occurrence-of-a-character — AvitanD, Jun 17 '22 at 17:44
I would suggest use ```stack``` and track the closing parenthesis, once you have start and end of the part that you can remove, you are all set. — BhusalC_Bipin, Jun 17 '22 at 17:52

DarkKnight · Accepted Answer · 2022-06-18T06:37:11.970

It's been said that cannot (realistically) be achieved with a regular expression.

Here's a way that it could be done:

txt = "(i) Test text (see § 123.1 of this Proper Name (PN) subparagraph) and additional test text (see § 123.2 of this subparagraph) along with more test text (including a parenthetical) to test the text. (See § 125.3 of this subparagraph for even more test text.) (Hello World!)"
txt1 = txt[:]
inparens = []
pstart = 0
stack = 0

for i, c in enumerate(txt):
    if c == '(':
        if stack == 0:
            pstart = i
        stack += 1
    elif c == ')':
        if stack > 0:
            if (stack := stack - 1) == 0:
                segment = txt[pstart:i+1]
                if segment.lower().startswith('(see §'):
                    inparens.insert(0, (pstart, i+1))
for s, e in inparens:
    txt = txt[:s].rstrip() + txt[e:]
print(txt)

Output:

(i) Test text and additional test text along with more test text (including a parenthetical) to test the text. (Hello World!)

Essentially, what this does is to work through the string character by character maintaining a "stack" to allow for embedded segments with parentheses. Once the stack goes to zero - i.e., right parenthesis observed we then know the segment of interest which we inspect for relevance. Build an inverted list of start/end tuples and finally reconstruct the string by eliminating the unwanted segments.

Very probably not efficient but it does seem to work

BhusalC_Bipin · Answer 2 · 2022-06-17T19:02:35.447

As I suggested in comment, you can use stack to track the start and end index of the part you want to remove. Once you have those index, you can create a new string, as shown below:

sample_string = '(i) Test text (see § 123.1 of this Proper Name (PN) subparagraph) and additional test text (see § 123.2 of this subparagraph) along with more test text (including a parenthetical) to test the text. (See § 125.3 of this subparagraph for even more test text.) (Hello World!)'

# track the start and end index of sub_string to be removed
stack = []
index_tracker = []

start = -1
for index, char in enumerate(sample_string):
    if char == '(':
        if sample_string[index+1: index+6].casefold() == 'see §'.casefold():
            start = index
        stack.append('(')
    elif char == ')':
        stack.pop()

    if not stack and start != -1:
        index_tracker.append((start, index))
        start = -1

# create a new string
result_string = ''
for index, (start, end) in enumerate(index_tracker):
    if index > 0:
        prev_end = index_tracker[index-1][1]
        result_string += sample_string[prev_end + 1: start - 1]
    elif index == 0:
        result_string += sample_string[: start - 1]
result_string += sample_string[index_tracker[-1][1] + 1: ]


>>> print(result_string)
>>> (i) Test text and additional test text along with more test text (including a parenthetical) to test the text. (Hello World!)

score 0 · Answer 3 · answered Jun 17 '22 at 19:43

Unlike most people are saying here, this problem can be solved using regex if you only have two nested levels of parentheses. So, there may only not be any nested parntheses inside the deleted parts, but they do can contain one single level of parentheses. I assume that this is the case for your text.

For example, the following string would not be allowed, because there (Proper Name (PN)) contains three levels of nested parentheses:

(i) Test text (see § 123.1 of this (Proper Name (PN)) subparagraph) and additional test text (see § 123.2 of this subparagraph) along with more test text (including a parenthetical) to test the text. (See § 125.3 of this subparagraph for even more test text.) (Hello World!)

Here's the code:

import re

txt = '(i) Test text (see § 123.1 of this Proper Name (PN) subparagraph) and additional test text (see § 123.2 of this subparagraph) along with more test text (including a parenthetical) to test the text. (See § 125.3 of this subparagraph for even more test text.) (Hello World!)'

x = re.sub(r"\((s|S)ee §([^()]*\([^()]*\))*[^()]*\) ?", r'', txt)

print(x)

I've also added ? at the end of the pattern, to prevent double spaces where text is removed. Remove this if you don't want that.

As said in the post, more then two levels of parentheses cannot be handled by this code. As in most human-written posts this doesn't occurs, I assumed that the OP's texts also wouldn't. I also don't see your problem - the code doesn't remove anything of your text, as expected. — The_spider, Jun 17 '22 at 19:48
"As in most human-written posts this doesn't occurs"? What is the basis for that statement? Your answer works with the sample text but it's not flexible. What if the parentheses in the text are unbalanced? — DarkKnight, Jun 18 '22 at 06:40
The basis for this statement are my own experiences. I've rarely seen a text in which this occurs. Unbalanced parentheses would be almost unhandlable by any code, as there's no way to determine whether a closing parenthesis is closing the to delete part or it closes any other parenthesis within the to delete part. As such, I assume that this is also not the case here. — The_spider, Jun 18 '22 at 07:09

Is there a way to have python regex find the next instance of a character instead of the last instance?

3 Answers3