Extract all matching keywords from a list of words and create a new dataframe pandas

Question

I would like to extract all matching keywords from the opinions column and if they match with a word in keywords list then print all matching words(including repetitive words) in a new column. The current code only extracts the first matching word and doesn't include repetitive words.

import pandas as pd

df = pd.DataFrame({
    'opinions':[
        "I think the movie is fantastic. Shame it's so short!",
        "How did they make it?",
        "I had a fantastic time at the cinema last night!",
        "I really disliked the cast",
        "the film was sad and boring",
        "Absolutely loved the movie! Can't wait to see part 2",
    ]
})

keywords = ['movie', 'great', 'fantastic', 'loved']

query = '|'.join(keywords)
df['word'] = df['opinions'].str.extract( '({})'.format(query) )

print(df)

current output

Dani Mesejo · Accepted Answer · 2020-11-07T14:24:11.243

If you only want to match full words, you need the use word boundaries markers, otherwise prefixes (and suffixes) will be match also. For example:

import pandas as pd

df = pd.DataFrame({
    'opinions':[
        "I think the movie is fantastic. Shame it's so short!",
        "How did they make it?",
        "I had a fantastic time at the cinema last night!",
        "I really disliked the cast",
        "the film was sad and boring",
        "Absolutely loved the movie! Can't wait to see part 2",
        "He has greatness within"
    ]
})

keywords = ['movie', 'great', 'fantastic', 'loved']

query = '|'.join(keywords)
df['word'] = df['opinions'].str.findall(r'\b({})\b'.format(query))

print(df)

Output

                                            opinions                word
0  I think the movie is fantastic. Shame it's so ...  [movie, fantastic]
1                              How did they make it?                  []
2   I had a fantastic time at the cinema last night!         [fantastic]
3                         I really disliked the cast                  []
4                        the film was sad and boring                  []
5  Absolutely loved the movie! Can't wait to see ...      [loved, movie]
6                            He has greatness within                  []

In the above example greatness was not matched due to the word boundaries (\b).

A note on performance

As a side note if you are looking for an efficient solution for large data, union regexes, are not a good approach (see here). I suggest you use a library such as trrex.

import pandas as pd
import trrex as tx

df = pd.DataFrame({
    'opinions': [
        "I think the movie is fantastic. Shame it's so short!",
        "How did they make it?",
        "I had a fantastic time at the cinema last night!",
        "I really disliked the cast",
        "the film was sad and boring",
        "Absolutely loved the movie! Can't wait to see part 2",
        "He has greatness within"
    ]
})

keywords = ['movie', 'great', 'fantastic', 'loved']
query = tx.make(keywords, left=r"\b(", right=r")\b")

df['word'] = df['opinions'].str.findall(r'{}'.format(query))

print(df)

Output (using trrex)

                                            opinions                word
0  I think the movie is fantastic. Shame it's so ...  [movie, fantastic]
1                              How did they make it?                  []
2   I had a fantastic time at the cinema last night!         [fantastic]
3                         I really disliked the cast                  []
4                        the film was sad and boring                  []
5  Absolutely loved the movie! Can't wait to see ...      [loved, movie]
6                            He has greatness within                  []

For a comparison on performance see the image below:

For a set of 25K words trrex is 300 times faster than the union regex. The experiments from the image above can be reproduced with the following gist

DISCLAIMER: I'm the author of trrex

score 0 · Answer 2 · answered Nov 07 '20 at 11:23

You should replace extract by findall:

Find all occurrences of pattern or regular expression in the Series/Index.

Equivalent to applying re.findall() to all the elements in the Series/Index.

print(df)
                                                opinions                word
    0  I think the movie is fantastic. Shame it's so ...  [movie, fantastic]
    1                              How did they make it?                  []
    2   I had a fantastic time at the cinema last night!         [fantastic]
    3                         I really disliked the cast                  []
    4                        the film was sad and boring                  []
    5  Absolutely loved the movie! Can't wait to see ...      [loved, movie]

Extract all matching keywords from a list of words and create a new dataframe pandas

2 Answers2

A note on performance

Linked