3

I'm trying to remove the characters between the parentheses and brackets based on the length of characters inside the parentheses and brackets.

Using this:

def remove_text_inside_brackets(text, brackets="()[]"):
    count = [0] * (len(brackets) // 2) # count open/close brackets
    saved_chars = []
    for character in text:
        for i, b in enumerate(brackets):
            if character == b: # found bracket
                kind, is_close = divmod(i, 2)
                count[kind] += (-1)**is_close # `+1`: open, `-1`: close
                if count[kind] < 0: # unbalanced bracket
                    count[kind] = 0  # keep it
                else:  # found bracket to remove
                    break
        else: # character is not a [balanced] bracket
            if not any(count): # outside brackets
                saved_chars.append(character)
    return ''.join(saved_chars)

I'm able to remove the characters between the parentheses and brackets, but I cannot figure out how to remove the characters based on the length of characters inside.

I wanted to remove characters between the parentheses and brackets if the length <=4 with parentheses and brackets if they are >4 remove only parentheses and brackets. Sample Text:

text = "This is a sentence. (RMVE) (Once a day) [twice a day] [RMV]"

Output:

print(remove_text_inside_brackets(text))
This is a sentence. 

Desired Output:

This is a sentence. Once a day twice a day
Ailurophile
  • 2,552
  • 7
  • 21
  • 46
  • 1
    What would be the output of `"(ab[XX]c)"`? If you remove `[XX]` you'd have `(abc)` which violates the rule "I wanted to remove characters between the parentheses and brackets if the length <=4 with parentheses" and leaving it would also violate the same rule. – Ch3steR Dec 20 '21 at 06:32
  • 1
    I have a feeling this is not a use case ;) – mozway Dec 20 '21 at 06:46
  • @Ch3steR, You are right, it violates. But in that scenario, I want to remove `(ab[XX]c)`. But so far I don't have such cases in my text. – Ailurophile Dec 20 '21 at 06:52

5 Answers5

3

You can use a simple regex with re.sub and a function as replacement to check the length of the match:

import re
out = re.sub('\(.*?\)|\[.*?\]',
             lambda m: '' if len(m.group())<=(4+2) else m.group()[1:-1],
             text)

Output:

'This is a sentence.  Once a day twice a day '

This give you the logic for more complex checks, in which case you might want to define a named function rather than a lambda

mozway
  • 194,879
  • 13
  • 39
  • 75
2

How about splitting on [ and look for ] and measure length (since each split with ] will be necessarily longer than normal split, 4 becomes 5):

def remove_text_inside_brackets(string):
    my_str = string.replace('(','[').replace(')',']')
    out = []
    for s in my_str.split('['):
        if ']' in s and len(s) > 5:
            s1 = s.rstrip().rstrip(']') + ' '
        elif ']' in s and len(s) <= 5:
            s1 = ['']
        else:
            s1 = s
        out.extend(s1)
    return ''.join(out).strip()

remove_text_inside_brackets(text)

Output:

'This is a sentence. RMVE Once a day twice a day'
1

Someone will hopefully improve on this, but as an alternative, this nested regular expression can work:

re.sub(r'\[([^)]{5,})\]', '\g<1>', 
       re.sub(r'\(([^)]{5,})\)', '\g<1>', 
              re.sub(r'\[[^\]]{,4}\]', '', 
                     re.sub(r'\([^)]{,4}\)', '', text))))

Note that extra spaces, after the period and at the end of the line.

The output of this is slightly different than your given expected output:

'This is a sentence.  Once a day twice a day '

It completely removes text and its surrounding brackets when the length is 4 or shorter, while it replaces the match with just the inner text where the length if 5 or longer.

Note that nested brackets, e.g., ((some text) more text) or [(four)] may fail.

9769953
  • 10,344
  • 3
  • 26
  • 37
1

I would just use string.find, rather than go character by character. Too much state to track. Note that this will explode if there is an unmatched open paren or open bracket. That's not hard to catch.

text = "This is a sentence. (RMVE) (Once a day) [twice a day] [RMV]"

def remove_text_inside_brackets(text):
    i = 0
    while i >= 0:
        # Try for parens.
        i = text.find('(')
        j = text.find(')')
        if i < 0:
            # No parens, try for brackets.
            i = text.find('[')
            j = text.find(']')
        if i >= 0:
            if j-i > 5:
                text = text[:i] + text[i+1:j] + text[j+1:]
            else:
                text = text[:i] + text[j+1:]
    return text

print(remove_text_inside_brackets(text))
Tim Roberts
  • 48,973
  • 4
  • 21
  • 30
1

We can take help from regular expressions to solve this

import re
text = "This is a sentence. (RMVE) (Once a day) [twice a day] [RMV]"
text = re.sub('(\(|\[)[a-zA-Z]{1,4}(\)|\])', '', text)
print(re.sub('\[|\]|\(|\)', '', text))
output: "This is a sentence.  Once a day twice a day"

here in the regular expression i tried to match the pattern for 1 to 4 length of letter inside braces, along with braces, you can also match numbers and other special characters too.

Dharman
  • 30,962
  • 25
  • 85
  • 135
Santhosh Reddy
  • 123
  • 1
  • 6
  • This would match a parenthesis with a square bracket, e.g. "(text]" would be removed. Perhaps that is intended, but it's not clear from the question. – 9769953 Dec 20 '21 at 08:27
  • Also, it only matches 52 letters. A string like "[be4]" will not be removed. – 9769953 Dec 20 '21 at 08:28