I have a column called 'text' in my dataframe, where there is a lot of things written. I am trying to verify if in this column there is any of the strings from a list of patterns (e.g pattern1, pattern2, pattern3). I hope to create another boolean column stating if any of those patterns were found or not.
But, an important thing is to match the pattern when there are little mistyping issues. For example, if in my list of patterns I have 'mickey' and 'mouse', I want it to match with 'm0use' and 'muckey' too, not only the full correct pattern string.
I tried this, using regex lib:
import regex
list_of_patterns = ['pattern1','pattern2','pattern3','pattern4']
df['contains_any_pattern'] = df['text'].apply(lambda x: regex.search(pattern=('^(' + '|'.join(list_of_patterns) + ').${e<=2:[a-zA-Z]}'),string=x,flags=re.IGNORECASE))
I checked the text afterwards and could se that this is not working. Does anyone have a better idea to solve this problem?
Here is a short example:
df = pd.DataFrame({'id':[1,2,3,4,5],
'text':['my name is mickey mouse',
'my name is donkey kong',
'my name is mockey',
'my surname is m0use',
'hey, its me, mario!'
]})
list_of_patterns = ['mickey','mouse']
df['contains_pattern'] = df['text'].apply(lambda x: regex.search(pattern=r'(?i)^('+ '|'.join(list_of_patterns) +'){s<=2:[a-zA-Z]}',string=x))
And here is the resulting df:
id text contains_pattern
1 my name is mickey mouse None
2 my name is donkey kong None
3 my name is mockey None
4 my surname is m0use None
5 hey,its me, mario None