Fstest way to check for multiple string match of a dataframe column

Question

I am currently trying to find a string match from a dataframe that has list of actors and the movies that they acted in.

my_favourite_actors = ['Clint Eastwood','Morgan Freeman','Al Pacino']

Actor	Movie
Morgan Freeman, Tim Robbins, Bob Gunton, William Sadler, Clancy Brown	The Shawshank Redemption
Marlon Brando, Al Pacino, James Caan	The Godfather
Christian Bale, Heath Ledger, Aaron Eckhart, Gary Oldman, Maggie Gyllenhaal, Morgan Freeman	The Dark Knight
Henry Fonda, Lee Cobb, Martin Balsam	12 Angry Men
Liam Neeson, Ralph Fiennes, Ben Kingsley	Schindler's List
Elijah Wood, Viggo Mortensen, Ian McKellen	The Lord of the Rings: The Return of the King
John Travolta, Uma Thurman, Samuel Jackson	Pulp Fiction
Clint Eastwood, Eli Wallach, Lee Van Cleef	The Good, the Bad and the Ugly
Brad Pitt, Edward Norton, Meat Loaf	Fight Club
Leonardo DiCaprio, Joseph Gordon-Levitt,	Inception

I am currently using the following approach to do the string matching, but it's taking a very long time since the whole dataset almost has 100K rows.

def favourite_actor(movie_dataset):
    for actor in my_favourite_actors:
        movie_index= movie_dataset.loc[movie_dataset['Actor'].str.contains(actor , case=False)].index
        movie_dataset["_IsActorFound"].iloc[movie_index] = 1

The rows that will find my favourite actors will insert the value of 1 to it's adjacent column of ['_IsActorFound']

What can be an optimal and fast way to do the string match iteratively as my current code is taking extremely long time to execute?

Vivek Kalyanarangan · Answer 1 · 2022-07-08T11:52:29.317

0

Use -

df['Actor'].str.contains('|'.join(my_favourite_actors), regex=True, case=False)

Output

0     True
1     True
2     True
3    False
4    False
5    False
6    False
7     True
8    False
9    False
Name: Actor, dtype: bool

Explanation

Create a regex on the fly with the list, and then use .str.contains() accessor in pandas. | means set to True if any one element of the list matches.

edited Jul 08 '22 at 11:52

answered Jul 08 '22 at 11:46

Vivek Kalyanarangan

8,951
1
23
42

score 0 · Answer 2 · answered Jul 08 '22 at 11:50

You could use the apply function as follows:

def find_actor(s, actors):
    for actor in actors:
        if actor in s.lower():
            return 1
    
    return 0

df['Actor'].apply(find_actor, actors=my_favourite_actors.lower())

The advantage is that it only checks until one of the actors is found. Please note that for strings the apply function is ok to use because str.contains() is also not vectorized under the hood.

Fstest way to check for multiple string match of a dataframe column

2 Answers2