0

I have a column in a dataframe with strings like this "Boris" and other with extra text between paranthesis, like this "Igor (king)". I just want to get a column with Boris / Igor / ... (everything between parenthesis deleted). I tried this

pattern = '(^[\w]*)(?:[w]* \()'
Test =df['column'].str.extract(pattern)

I got back only the names that have extra text with parenthesis : i get NaN / Igor /Nan

Some help ?

mujjiga
  • 16,186
  • 2
  • 33
  • 51
Hervé Anv
  • 23
  • 4

2 Answers2

0
df = pd.DataFrame({'name': ['Boris', 'Igor (King)', "Jack (prince of Persia)"]})
df['name'] = df['name'].apply(lambda x: re.sub(r"\(.*\)", "", x).strip())

Output:

    name
0   Boris
1   Igor
2   Jack
mujjiga
  • 16,186
  • 2
  • 33
  • 51
  • Thanks Mujiga. I stilll have an issue with stings like these "Jack (prince of Persia)" : nothing is replaced. I still have "Jack (prince of Persia)" – Hervé Anv Oct 04 '20 at 14:55
  • May be replace everything inbetween and including `()`. Updated the answer – mujjiga Oct 04 '20 at 15:58
0

If you want to keep the first word and remove the following contents between the parenthesis, you have to extend your pattern to match till the closing parenthesis.

You could use str.replace and use capture group 1 in the replacement.

^(\w+) \([^()]+\)

Explanation

  • ^ Start of string
  • (\w+) Capture group 1, match 1+ word characters followed by a space, or use \s+ to match 1+ whitespace characters instead
  • \([^()]+\) Match from ( till ) using a negated character class matching any character except ( or )

Regex demo

For example

df = pd.DataFrame({'column': ['Boris', 'Igor (King)', 'Jack (prince of Persia)']})
df =df['column'].str.replace(r"^(\w+) \([^()]+\)", r"\1")
print(df)

Output

0    Boris
1     Igor
2     Jack
The fourth bird
  • 154,723
  • 16
  • 55
  • 70