0
movies['year'] = movies['title'].str.extract('.*\((.*)\).*', expand=True)

In the above line of code please explain the use of every character in the .*\((.*)\).*. How it will extract 1995 from Toy Story(1995)?

1 Answers1

0

What you doing above only works as your year is in (), for example the below doesn't work:

In [98]: pd.Series(["Toy Story 1995"]).str.extract('.*\((.*)\).*', expand=True)                                                   
Out[98]: 
     0
0  NaN

In [99]: pd.Series(["Toy Story (test)"]).str.extract('.*\((.*)\).*', expand=True)                                                 
Out[99]: 
      0
0  test

The above is finding all elements between brackets. The *. is literally matching all elements, the / is escaping the outer () and the inner () are for specifying the capturing group i.e. what pattern.

You probably want to do something as per below, /d is short hand for digit [0-9] and {4} highlights the expected length, so if you know the year format is yyyy then could do the below:

movies['title'].str.extract('(\d{4})')

This thread has a more general example using /d+

RK1
  • 2,384
  • 1
  • 19
  • 36