Split at \P{IsAlphabetic}
(uppercase P
)
String s = "überbrücken röntgenstraheln ängstlich";
String[] textArr = s.split("\\P{IsAlphabetic}");
System.out.println(Arrays.toString(textArr));
Output:
[überbrücken, röntgenstraheln, ängstlich]
The basic regex classes like \W
only recognize ASCII characters, so only A through Z and a through z count as letters, which explains the result you observed. There is support for Unicode characters too, though, through some of the \P{…}
constructs. See Andreas’s knowledgeable answer and the documentation for more.
Disclaimer: I wanted to keep my code simple and guessed that it might be what you were really after. I have made no attempt to mimic what your own code does only adjusted for vowels with umlaut. I trust you to adjust my code from here if it’s not exactly what you wanted.