How can I remove texts within parentheses with a regex in python?

Question

but it is not working.

how I solve my problem?

def clean_text(text):
    pattern = '([a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+)' 
    text = re.sub(pattern=pattern, repl='', string=text)
    pattern = '(http|ftp|https)://(?:[-\w.]|(?:%[\da-fA-F]{2}))+'
    text = re.sub(pattern=pattern, repl='', string=text)
    pattern = '([ㄱ-ㅎㅏ-ㅣ]+)'  
    text = re.sub(pattern=pattern, repl='', string=text)
    pattern = '<[^>]*>'        
    text = re.sub(pattern=pattern, repl='', string=text)
    pattern = '[^\w\s]'        
    text = re.sub(pattern=pattern, repl='', string=text)
    pattern = '\([^)]*\)'  ## not working!!
    text = re.sub(pattern=pattern, repl='', string=text)
    return text   

text = '(abc_def) 좋은글! (이것도 지워조) http://1234.com 감사합니다. aaa@goggle.comㅋㅋ<H1>thank you</H1>'
clean_text(text)

The result is abc_def 좋은글 이것도 지워조 감사합니다 thank you

My goal is 좋은글 감사합니다 thank you

Your question and the expected value doesn't really match? How do you want `text` to be cleaned up? Please update your "goal" — abdusco, Jul 22 '19 at 11:58
Swap the last two re.subs. First, use `text = re.sub(pattern=r'\([^)]*\)', repl='', string=text)` and then the `'[^\w\s]'` regex replacement. — Wiktor Stribiżew, Jul 22 '19 at 12:09

score 1 · Answer 1 · answered Jul 22 '19 at 12:14

Try this:

    def clean_text(text):
        pattern = '([a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+)'
        text = re.sub(pattern=pattern, repl='', string=text)
        pattern = '(http|ftp|https)://(?:[-\w.]|(?:%[\da-fA-F]{2}))+'
        text = re.sub(pattern=pattern, repl='', string=text)
        pattern = '([ㄱ-ㅎㅏ-ㅣ]+)'
        text = re.sub(pattern=pattern, repl='', string=text)
        pattern = '<[^>]*>'
        text = re.sub(pattern=pattern, repl='', string=text)
        pattern = '\([^)]*\)\s'  ## not working!!
        text = re.sub(pattern=pattern, repl='', string=text)
        pattern = '[^\w\s+]'
        text = re.sub(pattern=pattern, repl='', string=text)
        pattern = '\s{2,}'
        text = re.sub(pattern=pattern, repl=' ', string=text)
        return text

The result will be exact 좋은글 감사합니다 thank you

score 1 · Accepted Answer · edited Jul 22 '19 at 12:30

Your [^\w\s] re.sub removes the parentheses and thus the last regex does not match. You may swap the last two re.subs and use

import re
def clean_text(text):
    pattern = '([a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+)' 
    text = re.sub(pattern=pattern, repl='', string=text) 
    pattern = r'(?:http|ftp|https)://(?:[-\w.]|(?:%[\da-fA-F]{2}))+' 
    text = re.sub(pattern=pattern, repl='', string=text) 
    pattern = r'[ㄱ-ㅎㅏ-ㅣ]+' 
    text = re.sub(pattern=pattern, repl='', string=text) 
    pattern = r'<[^>]*>' 
    text = re.sub(pattern=pattern, repl='', string=text)  
    pattern = r'\s*\([^)]*\)' 
    text = re.sub(pattern=pattern, repl='', string=text)
    pattern = r'[^\w\s]' 
    text = re.sub(pattern=pattern, repl='', string=text)
    return text.strip()

text = '(abc_def) 좋은글! (이것도 지워조) http://1234.com 감사합니다. aaa@goggle.comㅋㅋ<H1>thank you</H1>' 
print(clean_text(text))

See the online Python demo.

I suggest using raw string literals (note the r'' prefixes) and stripping the unnecessary spaces with text.strip(). The \s* in r'\s*\([^)]*\)' will remove 0 or more whitespaces before parentheses.

How can I remove texts within parentheses with a regex in python?

2 Answers2