0

I am slowly getting a bit frustrated with regex and German Umlaute. For a speech reader function, I need to get each single word, wrap each word in a span-tag with an index and so. It all works fine so far. But as soon as a word starts with a German Umlaut it fails.

Here is an example text:

Touch-Screen ist Englisch und wird Tatsch-Skrien ausgesprochen. Übersetzt bedeutet Touchscreen Berührungsbildschirm. Es ist ein berührungsempfindlicher Bildschirm. Zur Bedienung muss man mit dem Finger die Oberfläche des Bildschirms berühren.

And this is the regular expression to find the word "Übersetzt":

/\b(Übersetzt)\b(?![^<span.*>]*<\/span>)/

The span-tag part is needed, since each word will get wrapped in <span id="SOME_ID">word</span>. So to prevent finding the same word twice, this will exclude those already span-tag wrapped words.

But it fails with words starting with an Umlaut. I added other words starting with an Umlaut, and they all failed. If an Umlaut is somewhere in the middle, it doesn't matter. But as soon as a word starts with an Umlaut, it just doesn't work. Maybe someone has a suggestion what I am missing here. Thanks for any advice or help.

I tried figuring out a solution using https://regexr.com/. But it didn't help. Even tried to use \u00dc instead of Ü didn't help in this regex.

MllrArt
  • 11
  • 1
  • `/(?<=\P{L})(Übersetzt)(?=\P{L})/u`. I’m not comfortable seeing HTML tags in a regular expression. There are much better alternatives, using DOM APIs. – Sebastian Simon Dec 24 '22 at 14:55
  • @SebastianSimon Thanks so much! That saved me a lot of trial and error with regexr. I know it is not the best way. But it was the quickest way to implement text highlighting using the speech synthesis. Especially since I don't know how the final styling highlighting will look like. So each word has an ID and a class for highlighting. – MllrArt Dec 24 '22 at 15:03

0 Answers0