Using regex
:
df[['movie_title', 'year']] = df.title.str.extract('(.*)\s\((\d+)', expand=True)
Sample Data:
df = pd.DataFrame({'title':
['Toy Story (1995)', 'Jumanji (1995)', 'Grumpier Old Men (1995)',
'Waiting to Exhale (1995)', 'Father of the Bride Part II (1995)',
'Hello (Goodbye) (1995)'
]})
df[['movie_title', 'year']] = df.title.str.extract('(.*)\s\((\d+)', expand=True)
Output:
title movie_title year
0 Toy Story (1995) Toy Story 1995
1 Jumanji (1995) Jumanji 1995
2 Grumpier Old Men (1995) Grumpier Old Men 1995
3 Waiting to Exhale (1995) Waiting to Exhale 1995
4 Father of the Bride Part II (1995) Father of the Bride Part II 1995
5 Hello (Goodbye) (1995) Hello (Goodbye) 1995
The regular expression we use is: '(.*)\s\((\d+)'
.
The first part of the expression is: (.*)
. The parenthesis indicate it's a capturing group, and what's inside the parenthesis indicates what we want to capture. In this case .*
indicates that we greedily want to capture everything. After the capturing group, we have \s\(
which is literally interpreted as a space folllwed by an open parenthesis, so ' ('
. Because the capturing group before this is greedy, we will capture everything up until the final ' ('
if there are multiple such matches. (for instance see how it captures Hello (Goodbye) (1995)
properly.
Finally, we add a second capturing group with \d+
as the thing we want to capture, which captures all of the numbers after the final ' ('
in your title, which for us will be the 4 digit year.