Match star * character at end of word boundary \b

Question

In building a lightweight tool that detects censored profanity usage, I noticed that detecting special characters at the end of a word boundary is quite difficult.

Using a tuple of strings, I build a OR'd word boundary regular expression:

import re

PHRASES = (
    'sh\\*t',  # easy
    'sh\\*\\*',  # difficult
    'f\\*\\*k',  # easy
    'f\\*\\*\\*',  # difficult
)

MATCHER = re.compile(
    r"\b(%s)\b" % "|".join(PHRASES), 
    flags=re.IGNORECASE | re.UNICODE)

The problem is that the * is not something that can be detected next to a word boundary \b.

print(MATCHER.search('Well f*** you!'))  # Fail - Does not find f***
print(MATCHER.search('Well f***!'))  # Fail - Does not find f***
print(MATCHER.search('f***'))  # Fail - Does not find f***
print(MATCHER.search('f*** this!'))  # Fail - Does not find f***
print(MATCHER.search('secret code is 123f***'))  # Pass - Should not match
print(MATCHER.search('f**k this!'))  # Pass - Should find

Any ideas for setting this up in a convenient way to support phrases that end in special characters?

Do you mean the problem that the `\b` is between the `f` and `*` because `*` is not a word character but `f` is? — Yunnosch, Oct 12 '19 at 17:39
@Yunnosch exactly the problem. I'm maybe looking for a `\b` alternative that supports special characters at the boundary. — tester, Oct 12 '19 at 17:40
Please make a long list of example, some which should match and some which should not. Also show the regex you uses successfully for the "easy" matches and which you have unsuccessfully used for the difficult matches. — Yunnosch, Oct 12 '19 at 17:41
How about making four lists of phrases, "easy", "start nonword", "end nonword", "startend nonword". Then make four corresponding matchers, which expect "\bs\b", "[^\s]s\b", "\bs[\s$]" and "[^\s]s[\s$]" around. — Yunnosch, Oct 12 '19 at 17:49

score 5 · Accepted Answer · answered Oct 28 '19 at 09:48

5

The * is not a word character thus no mach, if followed by a \b and a non word character.

Assuming the initial word boundary is fine but you want to match sh*t but not sh*t* or match f***! but not f***a how about simulating your own word boundary by use of a negative lookahead.

\b(...)(?![\w*])

See this demo at regex101

If needed, the opening word boundary \b can be replaced by a negative lookbehind: (?<![\w*])

answered Oct 28 '19 at 09:48

bobble bubble

16,888
3
27
46

This is like the holy grail of regular expressions. Thank you! `re.compile(r"(?<![\w*])(%s)(?![\w*])" % "|".join(PROFANE_PHRASES), flags=re.IGNORECASE | re.UNICODE)` – tester Oct 30 '19 at 00:32

score 1 · Answer 2 · answered Oct 12 '19 at 17:59

Use your knowledge of the starts and endings of the phrases and use them with corresponding matchers.
Here is a static version, but it is easy to sort incoming new phrases automatically according to the start and ending.

import re

PHRASES1 = (
    'sh\\*t',  # easy
    'f\\*\\*k',  # easy
)
PHRASES2 = (
    'sh\\*\\*',  # difficult
    'f\\*\\*\\*',  # difficult
)
PHRASES3 = (
    '\\*\\*\\*hole', 
)
PHRASES4 = (
    '\\*\\*\\*sonofa\\*\\*\\*\\*\\*',  # easy
)
MATCHER1 = re.compile(
    r"\b(%s)\b" % "|".join(PHRASES1), 
    flags=re.IGNORECASE | re.UNICODE)
MATCHER2 = re.compile(
    r"\b(%s)[$\s]" % "|".join(PHRASES2), 
    flags=re.IGNORECASE | re.UNICODE)
MATCHER3 = re.compile(
    r"[\s^](%s)\b" % "|".join(PHRASES3), 
    flags=re.IGNORECASE | re.UNICODE)
MATCHER4 = re.compile(
    r"[\s^](%s)[$\s]" % "|".join(PHRASES4), 
    flags=re.IGNORECASE | re.UNICODE)

score 0 · Answer 3 · answered Oct 12 '19 at 17:54

0

Could embed the boundary requirements in each string like

'\\bsh\\*t\\b', 
'\\bsh\\*\\*',  
'\\bf\\*\\*k\\b',  
'\\bf\\*\\*\\*',

then r"(%s)" % "|".join(PHRASES)

Or, if the regex engine supports conditionals, its done like this

'sh\\*t', 
'sh\\*\\*',  
'f\\*\\*k',  
'f\\*\\*\\*',

then r"(?(?=\w)\b)(%s)(?(?<=\w)\b)" % "|".join(PHRASES)

answered Oct 12 '19 at 17:54

Interesting idea with the embeddings. Python 3 doesn't seem to support that last bit: `re.compile(r"(?(?=\w)\b)(%s)(?(?<=\w)\b)")` -> `bad character in group name '?=\\w' at position 3` – tester Oct 12 '19 at 18:53
1

@tester - For sure they're choices that can be made. A word boundary is nothing more than a pair of lookarounds at the current position. If there is a mix of text words like in the sample given, no boundary's can be generalized without a conditional. Otherwise, it has to be homegrown set on each sample item. No amount of bounty offered will change that. – Oct 28 '19 at 15:48

score 0 · Answer 4 · answered Oct 29 '19 at 19:46

I don't fully understand your statement that * is not something that can be found next to a word boundary. However, if I understand what you are looking for correctly from the comments, I think this would work:

\b[\w]\*+[\w]*

Word boundary
Followed by some letter, like f
Followed by one or many *
Optionally ending in some letter, like k

Example:

https://regexr.com/4nqie

Match star * character at end of word boundary \b

4 Answers4