How to extract specific content in a pandas dataframe with a regex?

Question

Consider the following pandas dataframe:

In [114]:

df['movie_title'].head()


Out[114]:

0     Toy Story (1995)
1     GoldenEye (1995)
2    Four Rooms (1995)
3    Get Shorty (1995)
4       Copycat (1995)
...
Name: movie_title, dtype: object

Update: I would like to extract with a regular expression just the titles of the movies. So, let's use the following regex: \b([^\d\W]+)\b. So I tried the following:

df_3['movie_title'] = df_3['movie_title'].str.extract('\b([^\d\W]+)\b')
df_3['movie_title']

However, I get the following:

0       NaN
1       NaN
2       NaN
3       NaN
4       NaN
5       NaN
6       NaN
7       NaN
8       NaN

Any idea of how to extract specific features from text in a pandas dataframe?. More specifically, how can I extract just the titles of the movies in a completely new dataframe?. For instance, the desired output should be:

Out[114]:

0     Toy Story
1     GoldenEye
2    Four Rooms
3    Get Shorty
4       Copycat
...
Name: movie_title, dtype: object

jezrael · Accepted Answer · 2016-03-16T07:51:13.467

58

You can try str.extract and strip, but better is use str.split, because in names of movies can be numbers too. Next solution is replace content of parentheses by regex and strip leading and trailing whitespaces:

#convert column to string
df['movie_title'] = df['movie_title'].astype(str)

#but it remove numbers in names of movies too
df['titles'] = df['movie_title'].str.extract('([a-zA-Z ]+)', expand=False).str.strip()
df['titles1'] = df['movie_title'].str.split('(', 1).str[0].str.strip()
df['titles2'] = df['movie_title'].str.replace(r'\([^)]*\)', '').str.strip()
print df
          movie_title      titles      titles1      titles2
0  Toy Story 2 (1995)   Toy Story  Toy Story 2  Toy Story 2
1    GoldenEye (1995)   GoldenEye    GoldenEye    GoldenEye
2   Four Rooms (1995)  Four Rooms   Four Rooms   Four Rooms
3   Get Shorty (1995)  Get Shorty   Get Shorty   Get Shorty
4      Copycat (1995)     Copycat      Copycat      Copycat

edited Mar 16 '16 at 07:51

answered Mar 16 '16 at 07:38

jezrael

822,522
95
1,334
1,252

I got this: `TypeError: extract() got an unexpected keyword argument 'expand'` – tumbleweed Mar 16 '16 at 07:41
2

Do you update `pandas` to version `0.18.0`? Check it `print pd.show_versions()` – jezrael Mar 16 '16 at 07:41
I updated and got this: `AttributeError: Can only use .str accessor with string values, which use np.object_ dtype in pandas`. Now I have: `byteorder: little LC_ALL: None LANG: None pandas: 0.18.0 nose: 1.3.7 pip: 8.1.0` – tumbleweed Mar 16 '16 at 07:50
Thanks for the help... just another issue, why when `astype(str)` is used I get the following exception: `UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 3: ordinal not in range(128)`. Note that the econding of the file is `encoding='iso-8859-1`, I all ready set it in the pandas dataframe, however, I got the previous exception....how should I deal with this ecoding problem? – tumbleweed Mar 17 '16 at 07:17
Yes:`df = pd.read_csv('ml-100k/u.item', \ sep = '|',names = ['movie_id','movie_title','release_date', \ 'video_release_date', 'IMDb-URL','unknown','Action','Adventure',\ 'Animation', 'Childrens','Comedy','Crime','Documentary'\ ,'Drama','Fantasy','Film-Noir','Horror','Musical','Mystery',\ 'Romance','Sci-Fi','Thriller', 'War' ,'Western'],encoding='iso-8859-1')` – tumbleweed Mar 17 '16 at 07:20
It seems very interesting, because I think string columns are converted to `utf-8` in `read_csv` by parameter `encoding`. Is posible share your file? – jezrael Mar 17 '16 at 07:31
Yes I am just practicing with this [dataset](https://github.com/ryotat/mlss15/tree/master/python/datasets/ml-100k)... so what do you think it's happening? – tumbleweed Mar 17 '16 at 07:35
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/106551/discussion-between-jezrael-and-ml-student). – jezrael Mar 17 '16 at 07:36
How would I, say, extract only movie title and titles1 using pandas? – 324 Sep 12 '19 at 01:41

su79eu7k · Answer 2 · 2016-03-16T07:59:09.543

9

You should assign text group(s) with () like below to capture specific part of it.

new_df['just_movie_titles'] = df['movie_title'].str.extract('(.+?) \(')
new_df['just_movie_titles']

pandas.core.strings.StringMethods.extract

StringMethods.extract(pat, flags=0, **kwargs)

Find groups in each string using passed regular expression

edited Mar 16 '16 at 07:59

answered Mar 16 '16 at 07:19

su79eu7k

7,031
3
34
40

score 1 · Answer 3 · answered Apr 14 '21 at 17:49

1

I wanted to extract the text after the symbol "@" and before the symbol "." (period) I tried this, it worked more or less because I have the symbol "@" but I don not want this symbol, anyway:

df['col'].astype(str).str.extract('(@.+.+)

answered Apr 14 '21 at 17:49

Joselin Ceron

474
5
3

score -1 · Answer 4 · edited Jul 27 '20 at 17:46

Using regular expressions to find a year stored between parentheses. We specify the parantheses so we don't conflict with movies that have years in their titles

movies_df['year'] = movies_df.title.str.extract('(\(\d\d\d\d\))',expand=False)

Removing the parentheses:

movies_df['year'] = movies_df.year.str.extract('(\d\d\d\d)',expand=False)

Removing the years from the 'title' column:

movies_df['title'] = movies_df.title.str.replace('(\(\d\d\d\d\))', '')

Applying the strip function to get rid of any ending whitespace characters that may have appeared:

movies_df['title'] = movies_df['title'].apply(lambda x: x.strip())

How to extract specific content in a pandas dataframe with a regex?

4 Answers4

Linked