Extraction of numeric data from a string

Question

movies['year'] = movies['title'].str.extract('.*\((.*)\).*', expand=True)

In the above line of code please explain the use of every character in the .*\((.*)\).*. How it will extract 1995 from Toy Story(1995)?

That's a regular expression: https://stackoverflow.com/questions/4736/learning-regular-expressions — Willem Van Onsem, Oct 12 '19 at 13:11

RK1 · Answer 1 · 2019-10-12T14:29:56.263

What you doing above only works as your year is in (), for example the below doesn't work:

In [98]: pd.Series(["Toy Story 1995"]).str.extract('.*\((.*)\).*', expand=True)                                                   
Out[98]: 
     0
0  NaN

In [99]: pd.Series(["Toy Story (test)"]).str.extract('.*\((.*)\).*', expand=True)                                                 
Out[99]: 
      0
0  test

The above is finding all elements between brackets. The *. is literally matching all elements, the / is escaping the outer () and the inner () are for specifying the capturing group i.e. what pattern.

You probably want to do something as per below, /d is short hand for digit [0-9] and {4} highlights the expected length, so if you know the year format is yyyy then could do the below:

movies['title'].str.extract('(\d{4})')

This thread has a more general example using /d+

Extraction of numeric data from a string

1 Answers1