0

When I split a String of words, the umlauts are deleted.

public static void main(String[] args) {
    String s = "überbrücken";
    String[] ss = s.split("\\W");
    System.out.println(ss[0] + ss[1] + ss[2]);
}

returns "berbrcken" instead of "überbrücken"

3 Answers3

4

Split at \P{IsAlphabetic} (uppercase P)

    String s = "überbrücken röntgenstraheln ängstlich";
    String[] textArr = s.split("\\P{IsAlphabetic}");
    System.out.println(Arrays.toString(textArr));

Output:

[überbrücken, röntgenstraheln, ängstlich]

The basic regex classes like \W only recognize ASCII characters, so only A through Z and a through z count as letters, which explains the result you observed. There is support for Unicode characters too, though, through some of the \P{…} constructs. See Andreas’s knowledgeable answer and the documentation for more.

Disclaimer: I wanted to keep my code simple and guessed that it might be what you were really after. I have made no attempt to mimic what your own code does only adjusted for vowels with umlaut. I trust you to adjust my code from here if it’s not exactly what you wanted.

Ole V.V.
  • 81,772
  • 15
  • 137
  • 161
  • `\w` is **not** the same as `Alphabetic`!! --- *Hint:* Digits and underscore are not "alphabetic" characters. – Andreas Feb 18 '21 at 20:53
  • @Andreas Obviously completely correct. I think that it’s very clear from the name `IsAlphabetic` that it doesn’t include digits and underscore. – Ole V.V. Feb 18 '21 at 22:00
  • 1
    @OleV.V. But it's so easy for people who know this part of regex (like you) to make it right, i.e. make the regex fit the question's regex but with accented character support, that not doing so or even mention that it's different, got me a bit hot. Sorry about that, but why not just use `[^\\p{IsAlphabetic}_\\p{IsDigit}]` to get the same definition as `\\W`? – Andreas Feb 18 '21 at 23:13
4

The documentation, i.e. the javadoc of Pattern, explicitly states:

\W - A non-word character: [^\w]

\w - A word character: [a-zA-Z_0-9]

Which means that accented characters are not included.

There are 2 ways to fix this:

  1. Specify flag UNICODE_CHARACTER_CLASS.

    That can be done by adding that flag as the second argument to Pattern.compile(), or by specifying the flag in the regex itself:

    split("(?U)\\W")
    
  2. Use Unicode Categories:

    split("[^\\p{L}_\\p{N}]")
    
Andreas
  • 154,647
  • 11
  • 152
  • 247
0

As an alternative solution, you can add some delimiter characters to the non-word characters and split the string around those delimeters, keeping the non-word characters:

String str = "überbrücken";

String[] arr = str
        // add some delimiters to a non-empty
        // sequences of non-word characters
        .replaceAll("\\W+", "$0\u2980")
        // split the string into an array
        // around these delimiters
        .split("\u2980");

// output
System.out.println(Arrays.toString(arr));
// [ü, berbrü, cken]

See also:
How to remove sequence of two elements from array or list?
How do I sort lexicographically with sorted(comparator) method?