str.contains not working when there is not a space between the word and special character

Question

I have a dataframe which includes the names of movie titles and TV Series.

From specific keywords I want to classify each row as Movie or Title according to these key words. However, due to brackets not having a space between the key words they are not being picked up by the str.contains() funtion and I need to do a workaround.

This is my dataframe:

import pandas as pd
import numpy as np

watched_df = pd.DataFrame([['Love Death Robots (Episode 1)'], 
                   ['James Bond'],
                   ['How I met your Mother (Avnsitt 3)'], 
                   ['random name'],
                   ['Random movie 3 Episode 8383893']], 
                  columns=['Title'])
watched_df.head()

To add the column that classifies the titles as TV series or Movies I have the following code.

watched_df["temporary_brackets_removed_title"] = watched_df['Title'].str.replace('(', '')
watched_df["Film_Type"] = np.where(watched_df.temporary_brackets_removed_title.astype(str).str.contains(pat = 'Episode | Avnsitt', case = False), 'Series', 'Movie')
watched_df = watched_df.drop('temporary_brackets_removed_title', 1)
watched_df.head()

Is there a simpler way to solve this without having to add and drop a column?

Maybe a str.contains-like function that does not look at a string being the exact same but just containing the given word? Similar to how in SQL you have the "Like" functionality?

Changing the pattern to `'Episode|Avnsitt'` do the job. – tlentali Nov 25 '21 at 18:32 — tlentali, Nov 25 '21 at 18:32

score 2 · Accepted Answer · edited Nov 25 '21 at 19:30

2

You can use str.contains and then map the results:

watched_df['Film_Type'] = watched_df['Title'].str.contains(r'(?:Episode|Avnsitt)').map({True: 'Series', False: 'Movie'})

Output:

>>> watched_df
                               Title Film_Type
0      Love Death Robots (Episode 1)    Series
1                         James Bond     Movie
2  How I met your Mother (Avnsitt 3)    Series
3                        random name     Movie
4     Random movie 3 Episode 8383893    Series

edited Nov 25 '21 at 19:30

Henry Ecker

34,399
18
41
57

answered Nov 25 '21 at 18:29

that definitely answers my question which I posted. However, after testing it I realized this doesn't word for when the words I am looking for are not attached to a bracket. Hence, column 4 where the title is "Random movie 3 Episode 8383893" it will be classified as a movie. – Sebastian ten Berge Nov 25 '21 at 18:37
1

Updated. Try now, @Sebastien. – Nov 25 '21 at 18:40
Works perfectly! thank you @user17242583. Do I understand the code correctly that the "r'" stands for regex and the ?: means any character can be in front of the words of the contains function? – Sebastian ten Berge Nov 25 '21 at 18:42
2

[What exactly do "u" and "r" string flags do, and what are raw string literals?](https://stackoverflow.com/q/2081640/15497888) explains the `r` prefix on the string (it is often used with regex just because it makes escaping special characters easier) and [What is a non-capturing group in regular expressions?](https://stackoverflow.com/q/3512471/15497888) covers the `?:` part (though technically neither the raw-string nor the non-capturing group is strictly necessary here since `.str.contains('Episode|Avnsitt')` also works fine in this case) @SebastiantenBerge – Henry Ecker Nov 25 '21 at 19:23
1

Sorry @Sebastien - I didn't see that comment! As Henry linked, the `r` prefix stands for `raw` (so you don't have to escape backslashes) and is most commonly used with regexes, but the `r` is merely a coincidence. I added `?:` to avoid warnings - you can read more at the links Henry provided. – Nov 25 '21 at 19:26

str.contains not working when there is not a space between the word and special character

1 Answers1