0

In building a lightweight tool that detects censored profanity usage, I noticed that detecting special characters at the end of a word boundary is quite difficult.

Using a tuple of strings, I build a OR'd word boundary regular expression:

import re

PHRASES = (
    'sh\\*t',  # easy
    'sh\\*\\*',  # difficult
    'f\\*\\*k',  # easy
    'f\\*\\*\\*',  # difficult
)

MATCHER = re.compile(
    r"\b(%s)\b" % "|".join(PHRASES), 
    flags=re.IGNORECASE | re.UNICODE)

The problem is that the * is not something that can be detected next to a word boundary \b.

print(MATCHER.search('Well f*** you!'))  # Fail - Does not find f***
print(MATCHER.search('Well f***!'))  # Fail - Does not find f***
print(MATCHER.search('f***'))  # Fail - Does not find f***
print(MATCHER.search('f*** this!'))  # Fail - Does not find f***
print(MATCHER.search('secret code is 123f***'))  # Pass - Should not match
print(MATCHER.search('f**k this!'))  # Pass - Should find 

Any ideas for setting this up in a convenient way to support phrases that end in special characters?

tester
  • 22,441
  • 25
  • 88
  • 128
  • 2
    Do you mean the problem that the `\b` is between the `f` and `*` because `*` is not a word character but `f` is? – Yunnosch Oct 12 '19 at 17:39
  • @Yunnosch exactly the problem. I'm maybe looking for a `\b` alternative that supports special characters at the boundary. – tester Oct 12 '19 at 17:40
  • Please make a long list of example, some which should match and some which should not. Also show the regex you uses successfully for the "easy" matches and which you have unsuccessfully used for the difficult matches. – Yunnosch Oct 12 '19 at 17:41
  • How about making four lists of phrases, "easy", "start nonword", "end nonword", "startend nonword". Then make four corresponding matchers, which expect "\bs\b", "[^\s]s\b", "\bs[\s$]" and "[^\s]s[\s$]" around. – Yunnosch Oct 12 '19 at 17:49

4 Answers4

5

The * is not a word character thus no mach, if followed by a \b and a non word character.

Assuming the initial word boundary is fine but you want to match sh*t but not sh*t* or match f***! but not f***a how about simulating your own word boundary by use of a negative lookahead.

\b(...)(?![\w*])

See this demo at regex101

If needed, the opening word boundary \b can be replaced by a negative lookbehind: (?<![\w*])

bobble bubble
  • 16,888
  • 3
  • 27
  • 46
  • This is like the holy grail of regular expressions. Thank you! `re.compile(r"(?<![\w*])(%s)(?![\w*])" % "|".join(PROFANE_PHRASES), flags=re.IGNORECASE | re.UNICODE)` – tester Oct 30 '19 at 00:32
1

Use your knowledge of the starts and endings of the phrases and use them with corresponding matchers.
Here is a static version, but it is easy to sort incoming new phrases automatically according to the start and ending.

import re

PHRASES1 = (
    'sh\\*t',  # easy
    'f\\*\\*k',  # easy
)
PHRASES2 = (
    'sh\\*\\*',  # difficult
    'f\\*\\*\\*',  # difficult
)
PHRASES3 = (
    '\\*\\*\\*hole', 
)
PHRASES4 = (
    '\\*\\*\\*sonofa\\*\\*\\*\\*\\*',  # easy
)
MATCHER1 = re.compile(
    r"\b(%s)\b" % "|".join(PHRASES1), 
    flags=re.IGNORECASE | re.UNICODE)
MATCHER2 = re.compile(
    r"\b(%s)[$\s]" % "|".join(PHRASES2), 
    flags=re.IGNORECASE | re.UNICODE)
MATCHER3 = re.compile(
    r"[\s^](%s)\b" % "|".join(PHRASES3), 
    flags=re.IGNORECASE | re.UNICODE)
MATCHER4 = re.compile(
    r"[\s^](%s)[$\s]" % "|".join(PHRASES4), 
    flags=re.IGNORECASE | re.UNICODE)
Yunnosch
  • 26,130
  • 9
  • 42
  • 54
0

Could embed the boundary requirements in each string like

'\\bsh\\*t\\b', 
'\\bsh\\*\\*',  
'\\bf\\*\\*k\\b',  
'\\bf\\*\\*\\*', 

then r"(%s)" % "|".join(PHRASES)

Or, if the regex engine supports conditionals, its done like this

'sh\\*t', 
'sh\\*\\*',  
'f\\*\\*k',  
'f\\*\\*\\*', 

then r"(?(?=\w)\b)(%s)(?(?<=\w)\b)" % "|".join(PHRASES)

  • Interesting idea with the embeddings. Python 3 doesn't seem to support that last bit: `re.compile(r"(?(?=\w)\b)(%s)(?(?<=\w)\b)")` -> `bad character in group name '?=\\w' at position 3` – tester Oct 12 '19 at 18:53
  • 1
    @tester - For sure they're choices that can be made. A word boundary is nothing more than a pair of lookarounds at the current position. If there is a mix of text words like in the sample given, no boundary's can be generalized without a conditional. Otherwise, it has to be homegrown set on each sample item. No amount of bounty offered will change that. –  Oct 28 '19 at 15:48
0

I don't fully understand your statement that * is not something that can be found next to a word boundary. However, if I understand what you are looking for correctly from the comments, I think this would work:

\b[\w]\*+[\w]*
  • Word boundary
  • Followed by some letter, like f
  • Followed by one or many *
  • Optionally ending in some letter, like k

Example:

https://regexr.com/4nqie

slf
  • 22,595
  • 11
  • 77
  • 101