3

I have the following dataframe as an example:

test = pd.DataFrame({'type':['fruit-of the-loom (sometimes-never)', 'yes', 'ok (not-possible) I will try', 'vegetable', 'poultry', 'poultry'],
                 'item':['apple', 'orange', 'spinach', 'potato', 'chicken', 'turkey']})

I found many posts of people wanting to remove parentheses from strings or similar situations, but in my case I would like to retain the string exactly as is, except I would like to remove the hyphen that is inside the parenthesis of the string.

Does anyone have a suggestion on how I could achieve this?

str.split() would take care of the hyphen if it was leading and str.rsplit() if it was trailing. I can't think of a way to engage this.

in this case the ideal outcome for the values in this hypothetical column would be:

'fruit-of the-loom (sometimes never)',
'yes', 
'ok (not possible) I will try', 
'vegetable', 
'poultry', 
'poultry'`

Adam
  • 2,820
  • 1
  • 13
  • 33
bls
  • 351
  • 2
  • 12

3 Answers3

2

One way could be to use str.replace with a pattern looking for what is between parenthesis, and the replace parameter could be a lambda using replace on the matching object:

print (test['type'].str.replace(pat='\((.*?)\)', 
                                repl=lambda x: x.group(0).replace('-',' ')))
0    fruit-of the-loom (sometimes never)
1                                    yes
2           ok (not possible) I will try
3                              vegetable
4                                poultry
5                                poultry
Name: type, dtype: object

Explanation of what is in pat= can be found here

Ben.T
  • 29,160
  • 6
  • 32
  • 54
1
test.type = (test.type.str.extract('(.*?\(.*?)-(.*?\))(.*)')
             .sum(1)
             .combine_first(test.type))

Explanation:

  • Extract regex groups of beginning until parenthesis and then hyphen and after hyphen until parenthesis and then optional additional stuff
  • Concatenate them together again with sum
  • Where, NaN, use the values from the original (combine_first)

This way the hyphen is dropped, not replaced by a space. If you need a space you could use apply instead of sum:

test.type = (test.type.str.extract('(.*?\(.*?)-(.*?\))(.*)')
             .apply(lambda row: ' '.join(row.values.astype(str)), axis=1)
             .combine_first(test.type))

Either way, this won't work for more than one set of parentheses.

Josh Friedlander
  • 10,870
  • 5
  • 35
  • 75
0

I should have taken a little longer to think about this one.

This is the solution I came up with"

count parenthesis, replace what is within proper count

def inside_parens(string):
    parens_count = 0
    return_string = ""
    for a in string:
        if a == "(":
            parens_count += 1
        elif a == ")":
            parens_count -= 1
        if parens_count > 0:
            return_string += a.replace('-', ' ')
        else:
            return_string += a
    return return_string


    return return_string

Once this is done apply it to the intended column:

df['col_1'] = df['col_1'].apply(inside_parens)

If you want to generalize the function you can actually just pass what you want to replace and make it more versatile.

bls
  • 351
  • 2
  • 12