If you only want to match full words, you need the use word boundaries markers, otherwise prefixes (and suffixes) will be match also. For example:
import pandas as pd
df = pd.DataFrame({
'opinions':[
"I think the movie is fantastic. Shame it's so short!",
"How did they make it?",
"I had a fantastic time at the cinema last night!",
"I really disliked the cast",
"the film was sad and boring",
"Absolutely loved the movie! Can't wait to see part 2",
"He has greatness within"
]
})
keywords = ['movie', 'great', 'fantastic', 'loved']
query = '|'.join(keywords)
df['word'] = df['opinions'].str.findall(r'\b({})\b'.format(query))
print(df)
Output
opinions word
0 I think the movie is fantastic. Shame it's so ... [movie, fantastic]
1 How did they make it? []
2 I had a fantastic time at the cinema last night! [fantastic]
3 I really disliked the cast []
4 the film was sad and boring []
5 Absolutely loved the movie! Can't wait to see ... [loved, movie]
6 He has greatness within []
In the above example greatness
was not matched due to the word boundaries (\b
).
A note on performance
As a side note if you are looking for an efficient solution for large data, union regexes, are not a good approach (see here). I suggest you use a library such as trrex.
import pandas as pd
import trrex as tx
df = pd.DataFrame({
'opinions': [
"I think the movie is fantastic. Shame it's so short!",
"How did they make it?",
"I had a fantastic time at the cinema last night!",
"I really disliked the cast",
"the film was sad and boring",
"Absolutely loved the movie! Can't wait to see part 2",
"He has greatness within"
]
})
keywords = ['movie', 'great', 'fantastic', 'loved']
query = tx.make(keywords, left=r"\b(", right=r")\b")
df['word'] = df['opinions'].str.findall(r'{}'.format(query))
print(df)
Output (using trrex)
opinions word
0 I think the movie is fantastic. Shame it's so ... [movie, fantastic]
1 How did they make it? []
2 I had a fantastic time at the cinema last night! [fantastic]
3 I really disliked the cast []
4 the film was sad and boring []
5 Absolutely loved the movie! Can't wait to see ... [loved, movie]
6 He has greatness within []
For a comparison on performance see the image below:

For a set of 25K words trrex is 300 times faster than the union regex. The experiments from the image above can be reproduced with the following gist
DISCLAIMER: I'm the author of trrex