Python - regex to split a column in 2 in a dataframe

Question

I have a column in a dataframe with strings like this "Boris" and other with extra text between paranthesis, like this "Igor (king)". I just want to get a column with Boris / Igor / ... (everything between parenthesis deleted). I tried this

pattern = '(^[\w]*)(?:[w]* \()'
Test =df['column'].str.extract(pattern)

I got back only the names that have extra text with parenthesis : i get NaN / Igor /Nan

Some help ?

Please update the question with some sample rows from the DataFrame so we can debug the regex. — S3DEV, Oct 04 '20 at 14:03
use re.sub with `\([^()]+\)` or use `(^\w+) \([^()]+\)` and replace with group 1 https://regex101.com/r/7cZq00/1 — The fourth bird, Oct 04 '20 at 14:58

mujjiga · Answer 1 · 2020-10-04T15:59:14.850

0

df = pd.DataFrame({'name': ['Boris', 'Igor (King)', "Jack (prince of Persia)"]})
df['name'] = df['name'].apply(lambda x: re.sub(r"\(.*\)", "", x).strip())

Output:

    name
0   Boris
1   Igor
2   Jack

edited Oct 04 '20 at 15:59

answered Oct 04 '20 at 14:07

mujjiga

16,186
2
33
51

Thanks Mujiga. I stilll have an issue with stings like these "Jack (prince of Persia)" : nothing is replaced. I still have "Jack (prince of Persia)" – Hervé Anv Oct 04 '20 at 14:55
May be replace everything inbetween and including `()`. Updated the answer – mujjiga Oct 04 '20 at 15:58

The fourth bird · Answer 2 · 2020-10-04T15:39:45.573

If you want to keep the first word and remove the following contents between the parenthesis, you have to extend your pattern to match till the closing parenthesis.

You could use str.replace and use capture group 1 in the replacement.

^(\w+) \([^()]+\)

Explanation

^ Start of string
(\w+) Capture group 1, match 1+ word characters followed by a space, or use \s+ to match 1+ whitespace characters instead
\([^()]+\) Match from ( till ) using a negated character class matching any character except ( or )

Regex demo

For example

df = pd.DataFrame({'column': ['Boris', 'Igor (King)', 'Jack (prince of Persia)']})
df =df['column'].str.replace(r"^(\w+) \([^()]+\)", r"\1")
print(df)

Output

0    Boris
1     Igor
2     Jack

Python - regex to split a column in 2 in a dataframe

2 Answers2