-1

I am trying to write a regex to match full words with or without an apostrophe.

I did this:

\b[a-zA-Z']+\b

However, it is matching the letters in bold Jönas while the desired is to not match the word Jönas at all because of the ö on it.

The right matches should go for anything in a-zA-Z'

Thus following cases should match in full:

Jonas
Don't
hasn'T

But not for:

Jönas
Dön't
Hélló

demo here: https://regex101.com/r/2sVN5S/1/ (where Jönas and Hélton should not be matched at all not even partially)

How to fix the regex, to follow this exact match?

tavalendo
  • 857
  • 2
  • 11
  • 30

1 Answers1

0

UPDATE. Anubhava and Wiktor Stribiżew pointed out that using \b[a-zA-Z']+\b in Unicode mode is enough (fiddle 1 and fiddle 2).

As said Wiktor, there is no use case the answer below is relevant (no engine supports look-around groups while not supporting Unicode mode). So this answer isn't anymore relevant.


You can use this regex:

\b(?<![\x80-\xFF])[a-zA-Z']+(?![\x80-\xFF])\b

Here, [\x80-\xFF] stands for a range of character codes above ASCII 7bit set (where non-english letters lies). Basically, it looks for:

  • a sequence of english letters with or without apostrophes ...
  • not preceded by non-english letters (negative look-before group (?<!...)
  • not followed by non-english letters (negative look-ahead group (?!...)

Working Regex101.com sample.

Amessihel
  • 5,891
  • 3
  • 16
  • 40