0

I have a sentence:

'hi how <unk> are you'

I need to remove <unk> from it.

Here is my code:

re.sub(r'\b{}\b'.format('<unk>'), '', 'agent transcript str <unk> with chunks for key phrases')

Why doesn't my RegEx work for <...>?

illuminato
  • 1,057
  • 1
  • 11
  • 33

1 Answers1

0

There is no word boundary between a space an < or >, you could instead try

re.sub(r'(\s*)<unk>(\s*)', r'\1\2', your_string)

Or - if you don't want two spaces, you may try

re.sub(r'(\s*)<unk>\s+', r'\1', your_string)


Remember that \b is a word boundary between a non-word character ([^\w+]+) and a word character (\w+ or [A-Za-z0-9_]). In your original string, you were trying to find a boundary between a space and a < or > where \b is not matching.
See a demo on regex101.com.
Jan
  • 42,290
  • 8
  • 54
  • 79
  • The `r'(\s*)(\s*)'` is a wrong solution. The right one: `r'(?<!\w){}(?!\w)'.format(re.escape(''))` or `r'(?<!\S){}(?!\S)'.format(re.escape(''))` – Wiktor Stribiżew Apr 15 '20 at 17:32
  • May I know is there any difference between ​re.sub(r'', '', 'hi how are you') and you RegEx? – illuminato Apr 15 '20 at 17:33
  • @illuminates You want to search for whole words, thus, you need unambiguous word boundaries. Or whitespace boundaries. Or custom boundaries, just you must come up with your definition of a "word". – Wiktor Stribiżew Apr 15 '20 at 17:35