1

Anybody know a way of dealing with apostrophes when extracting words from text using a regular expression?

>>> import re
>>> s = re.compile(r"\b[A-Za-z0-9_\-]+\b")
>>> s.findall("I don't know Sally's 'special' friend.")
['I', 'don', 't', 'know', 'Sally', 's', 'special', 'friend']

Desired result:

['I', "don't", 'know', 'Sally', 'special', 'friend']

This discussion covers how to find whole words but doesn't deal with apostrophes.

Community
  • 1
  • 1
Bill
  • 10,323
  • 10
  • 62
  • 85

1 Answers1

3
s = re.compile(r"(?:^|(?<=\s))[A-Za-z0-9_'\-]+(?=\s|$|\b)")

Use this instead of \b.lookarounds will work for you.See demo.

https://regex101.com/r/sS2dM8/25

vks
  • 67,027
  • 10
  • 91
  • 124