I am slowly getting a bit frustrated with regex and German Umlaute. For a speech reader function, I need to get each single word, wrap each word in a span-tag with an index and so. It all works fine so far. But as soon as a word starts with a German Umlaut it fails.
Here is an example text:
Touch-Screen ist Englisch und wird Tatsch-Skrien ausgesprochen. Übersetzt bedeutet Touchscreen Berührungsbildschirm. Es ist ein berührungsempfindlicher Bildschirm. Zur Bedienung muss man mit dem Finger die Oberfläche des Bildschirms berühren.
And this is the regular expression to find the word "Übersetzt":
/\b(Übersetzt)\b(?![^<span.*>]*<\/span>)/
The span-tag part is needed, since each word will get wrapped in <span id="SOME_ID">word</span>
.
So to prevent finding the same word twice, this will exclude those already span-tag wrapped words.
But it fails with words starting with an Umlaut. I added other words starting with an Umlaut, and they all failed. If an Umlaut is somewhere in the middle, it doesn't matter. But as soon as a word starts with an Umlaut, it just doesn't work. Maybe someone has a suggestion what I am missing here. Thanks for any advice or help.
I tried figuring out a solution using https://regexr.com/. But it didn't help. Even tried to use \u00dc instead of Ü didn't help in this regex.