1

Given a list as follows:

l = ['hydrogenated benzene (purity: 99.9 density (g/cm3), produced in ZB): SD', 
    'Car board price (tax included): JT Port', 
    'Ex-factory price (low-end price): Triethanolamine (85% commercial grade): North'
    ]

I would like to get the expected result as follows:

['hydrogenated benzene: SD', 'Car board price: JT Port', 'Ex-factory price: Triethanolamine: North']

With code below:

def remove_extra(content):
    pat1 = '[\s]'  # remove space
    pat2 = '\(.*\)' # remove content within parentheses
    combined_pat = r'|'.join((pat2, pat3))
    return re.sub(combined_pat, '', str(content))
[remove_extra(item) for item in l]

It generates:

['hydrogenated benzene : SD',
 'Car board price : JT Port',
 'Ex-factory price : North']

As you may notice, the last element of result 'Ex-factory price : North' is not as expected, how could I acheive what I need? Thanks.

ah bon
  • 9,293
  • 12
  • 65
  • 148
  • 1
    Is possible use `a = [re.sub(r'\((?:[^)(]|\([^)(]*\))*\)', '', str(item)) for item in l]` ? Is necessary remove space before `:` ? – jezrael Aug 30 '21 at 07:04
  • Looks make sense from the result. Do u mean use `'\((?:[^)(]|\([^)(]*\))*\)'` as `pat1` and `[\s]` as `pat2`? – ah bon Aug 30 '21 at 07:08
  • 1
    I test `[\s]` and it remove all spaces, seems it is not what you need. – jezrael Aug 30 '21 at 07:09

3 Answers3

2

You can modify linked solution with \s* for remove optionaly spaces before (:

#https://stackoverflow.com/a/37538815/2901002 
def remove_text_between_parens(text):
    n = 1  # run at least once
    while n:
        text, n = re.subn(r'\s*\([^()]*\)', '', text) #remove non-nested/flat balanced parts
    return text

a = [remove_text_between_parens(item) for item in l]
print (a)

['hydrogenated benzene: SD', 
 'Car board price: JT Port', 
 'Ex-factory price: Triethanolamine: North']
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
1

The inner parenthesis makes it complicated. The solution you see here works for your sample, but might not work for your whole dataset. Please udpate the question if you encountered error so we can find a solution.

This function first counts how many separate parenthesis exists in the string and then removes them.

def par_remover(st):
    begin = [ i.start() for i in re.finditer('\(', st)]
    end = [ i.start() for i in re.finditer('\)', st)]
    count = len(list(re.finditer('\(', st))) +1 - len([i for i in begin if i < end[0]])
    for i in range(count):
        begin = [ i.start() for i in re.finditer('\(', st)]
        end = [ i.start() for i in re.finditer('\)', st)]
        end1 = len([i for i in begin if i < end[0]])
        str_remove = st[st.find("("):list(re.finditer('\)', st))[end1-1].end()]
        st = st.replace(str_remove,'')
    return(st.replace(')',''))

df = pd.DataFrame({'value':l})

df['value'] = df['value'].apply(lambda st:par_remover(st))

result:

|    | value                                      |
|---:|:-------------------------------------------|
|  0 | hydrogenated benzene : SD                  |
|  1 | Car board price : JT Port                  |
|  2 | Ex-factory price : Triethanolamine : North |
Babak Fi Foo
  • 926
  • 7
  • 17
  • Thanks, but the expected result for last element of list should be `Ex-factory price: Triethanolamine: North`, instead of `Ex-factory price: North` – ah bon Aug 30 '21 at 06:50
1

The problem is not really your 3rd item but the first one because there is nested parenthesis. You should do a loop like this and use subn instead of sub

def remove_text_between_parens(text):
    n = 1
    while n:
        text, n = re.subn(r'\s*\([^()]*\)\s*', '', text)
    return text
>>> [remove_text_between_parens(t) for t in l]
['hydrogenated benzene: SD',
 'Car board price: JT Port',
 'Ex-factory price: Triethanolamine: North']

The right explanation is here: https://stackoverflow.com/a/37538815/15239951

Corralien
  • 109,409
  • 8
  • 28
  • 52