-2

I am currently trying to find a string match from a dataframe that has list of actors and the movies that they acted in.

my_favourite_actors = ['Clint Eastwood','Morgan Freeman','Al Pacino']
Actor Movie
Morgan Freeman, Tim Robbins, Bob Gunton, William Sadler, Clancy Brown The Shawshank Redemption
Marlon Brando, Al Pacino, James Caan The Godfather
Christian Bale, Heath Ledger, Aaron Eckhart, Gary Oldman, Maggie Gyllenhaal, Morgan Freeman The Dark Knight
Henry Fonda, Lee Cobb, Martin Balsam 12 Angry Men
Liam Neeson, Ralph Fiennes, Ben Kingsley Schindler's List
Elijah Wood, Viggo Mortensen, Ian McKellen The Lord of the Rings: The Return of the King
John Travolta, Uma Thurman, Samuel Jackson Pulp Fiction
Clint Eastwood, Eli Wallach, Lee Van Cleef The Good, the Bad and the Ugly
Brad Pitt, Edward Norton, Meat Loaf Fight Club
Leonardo DiCaprio, Joseph Gordon-Levitt, Inception

I am currently using the following approach to do the string matching, but it's taking a very long time since the whole dataset almost has 100K rows.

def favourite_actor(movie_dataset):
    for actor in my_favourite_actors:
        movie_index= movie_dataset.loc[movie_dataset['Actor'].str.contains(actor , case=False)].index
        movie_dataset["_IsActorFound"].iloc[movie_index] = 1 

The rows that will find my favourite actors will insert the value of 1 to it's adjacent column of ['_IsActorFound']

What can be an optimal and fast way to do the string match iteratively as my current code is taking extremely long time to execute?

Agnis
  • 1
  • 1

2 Answers2

0

Use -

df['Actor'].str.contains('|'.join(my_favourite_actors), regex=True, case=False)

Output

0     True
1     True
2     True
3    False
4    False
5    False
6    False
7     True
8    False
9    False
Name: Actor, dtype: bool

Explanation

Create a regex on the fly with the list, and then use .str.contains() accessor in pandas. | means set to True if any one element of the list matches.

Vivek Kalyanarangan
  • 8,951
  • 1
  • 23
  • 42
0

You could use the apply function as follows:

def find_actor(s, actors):
    for actor in actors:
        if actor in s.lower():
            return 1
    
    return 0

df['Actor'].apply(find_actor, actors=my_favourite_actors.lower())

The advantage is that it only checks until one of the actors is found. Please note that for strings the apply function is ok to use because str.contains() is also not vectorized under the hood.

vogelstein
  • 394
  • 1
  • 10