How do check if a text column in my dataframe, contains a list of possible patterns, allowing mistyping?

Question

I have a column called 'text' in my dataframe, where there is a lot of things written. I am trying to verify if in this column there is any of the strings from a list of patterns (e.g pattern1, pattern2, pattern3). I hope to create another boolean column stating if any of those patterns were found or not.

But, an important thing is to match the pattern when there are little mistyping issues. For example, if in my list of patterns I have 'mickey' and 'mouse', I want it to match with 'm0use' and 'muckey' too, not only the full correct pattern string.

I tried this, using regex lib:

import regex
list_of_patterns = ['pattern1','pattern2','pattern3','pattern4']
df['contains_any_pattern'] = df['text'].apply(lambda x: regex.search(pattern=('^(' + '|'.join(list_of_patterns) + ').${e<=2:[a-zA-Z]}'),string=x,flags=re.IGNORECASE))

I checked the text afterwards and could se that this is not working. Does anyone have a better idea to solve this problem?

Here is a short example:

df = pd.DataFrame({'id':[1,2,3,4,5],
                      'text':['my name is mickey mouse',
                              'my name is donkey kong',
                              'my name is mockey',
                              'my surname is m0use',
                              'hey, its me, mario!'
                             ]})

list_of_patterns = ['mickey','mouse']    
df['contains_pattern'] = df['text'].apply(lambda x: regex.search(pattern=r'(?i)^('+ '|'.join(list_of_patterns) +'){s<=2:[a-zA-Z]}',string=x))

And here is the resulting df:

id                       text      contains_pattern
1     my name is mickey mouse                  None
2      my name is donkey kong                  None
3           my name is mockey                  None
4         my surname is m0use                  None
5           hey,its me, mario                  None

Try `r'(?i)(' + '|'.join(list_of_patterns) + '){s<=1}'`, remove `flags=re.IGNORECASE`. Well, the `{e<=2}` quantifier may be kept if it does what you need. — Wiktor Stribiżew, Jan 02 '20 at 22:47
Is the _list\_of\_patterns_ created dynamically ? Patterns imply a regular expression. It is important to state if they are exclusively literal or not. If it is just literal, it can be factored via [ternary / trie](http://www.regexformat.com/version7_files/Rx5_ScrnSht01.jpg) for better performance — , Jan 02 '20 at 23:25
yes, the list_of_patterns is created dynamically. But, in order to test I am setting some fixed words that I am sure are contained in some lines from my df['text']. I tried what Wiktor Stribizew proposed, but still doesn't work. When I say doesn't work, I mean it does not recognize these fixed words in any line. What do you mean by exclusively literal or not? — Mariane Reis, Jan 03 '20 at 18:33

Wiktor Stribiżew · Accepted Answer · 2020-01-03T20:41:07.067

3

You can fix the code by using something like

df['contains_any_pattern'] = df['text'].apply(lambda x: regex.search(r'(?i)\b(?:' + '|'.join(list_of_patterns) + r'){e<=2}\b', x))

Or, if the search words may contain special chars use

pat = r'(?i)(?<!\w)(?:' + '|'.join([re.escape(p) for p in list_of_patterns]) + r'){e<=2}(?!\w)'
df['contains_any_pattern'] = df['text'].apply(lambda x: regex.search(pat, x))

The pattern will look like (?i)\b(?:mouse|mickey){e<=2}\b now. Adjust as you see fit, but make sure that the quantifier is right after the group.

The re.IGNORECASE is from the re package, you may simply use the inline modifier, (?i), to enable case insensitive matching with the current regex library.

If you need to handle hundreds or thousands of search terms, you may leverage the approach described in Speed up millions of regex replacements in Python 3.

edited Jan 03 '20 at 20:41

answered Jan 02 '20 at 23:26

Wiktor Stribiżew

607,720
39
448
563

I tried altering the regex code into what you proposed, but I still can not match a pattern that I am sure it is contained in some of the lines from df['text']. – Mariane Reis Jan 03 '20 at 20:18
@MarianeReis Please provide a string and expected match. Note that to match a whole word, you need to use word boundaries, not anchors, that is, `r'(?i)\b(' + '|'.join(list_of_patterns) + r'){e<=2}\b'` – Wiktor Stribiżew Jan 03 '20 at 20:18
I made an update to the question to give more information – Mariane Reis Jan 03 '20 at 21:57
1

Actually, it works with r'(?i)\b(?:' now! Thank you very much! – Mariane Reis Jan 03 '20 at 22:00
@MarianeReis Ok, but note that my second snippet in the answer is more generic. The most generic is the combination of that + the linked solution at the bottom of the answer. – Wiktor Stribiżew Jan 03 '20 at 22:28

How do check if a text column in my dataframe, contains a list of possible patterns, allowing mistyping?

1 Answers1

Linked