movies['year'] = movies['title'].str.extract('.*\((.*)\).*', expand=True)
In the above line of code please explain the use of every character in the .*\((.*)\).*
. How it will extract 1995 from Toy Story(1995)?
movies['year'] = movies['title'].str.extract('.*\((.*)\).*', expand=True)
In the above line of code please explain the use of every character in the .*\((.*)\).*
. How it will extract 1995 from Toy Story(1995)?
What you doing above only works as your year is in ()
, for example the below doesn't work:
In [98]: pd.Series(["Toy Story 1995"]).str.extract('.*\((.*)\).*', expand=True)
Out[98]:
0
0 NaN
In [99]: pd.Series(["Toy Story (test)"]).str.extract('.*\((.*)\).*', expand=True)
Out[99]:
0
0 test
The above is finding all elements between brackets. The *.
is literally matching all elements, the /
is escaping the outer ()
and the inner ()
are for specifying the capturing group i.e. what pattern.
You probably want to do something as per below, /d
is short hand for digit [0-9]
and {4}
highlights the expected length, so if you know the year format is yyyy
then could do the below:
movies['title'].str.extract('(\d{4})')
This thread has a more general example using /d+