I am working on an NLP project for work, and I'm struggling with this current part. I have a dataframe that contains requirements that are strings ('This system shall...'). We want to check every requirement against a list of words, subset the requirements that contain one or more of those words, and then add a column that contains just those words that were found in each requirement.
Requirement | Contained_Words |
---|---|
'This system shall...' | 'will','actions' |
The current problem I'm having is that its matching the pattern of the word, not the exact word, so the output is incorrect.
def bad_words(doc: pd.DataFrame):
words = 'will|must|actions'
results = doc['Requirement'].str.contains(words).any()
if results:
df = doc[doc['Requirement'].str.contains(words)]
print(df)
else:
print(f"No requirement contain the word(s): {words}.")