First, all Java Strings are UTF-16 encoded, not UTF-8. This is important for tasks like reversing strings, because the number of bytes a character will take up depends on the encoding. In UTF-8 the number of bytes is variable, whereas with UTF-16 it's always two bytes. A char
is 16 bits of data, even if it's just representing ASCII. UTF-8 can encode ASCII in 8 bits, but may take more to represent other characters.
Because a char
is 16 bits, most characters (including Ž®aͻ
from your example) all fit nicely into individual char
s, and there's no issues. However some characters (notably Emoji fall into this category) cannot be represented by a single char
, and now we're dealing with surrogate pairs. You have to be very careful with string manipulations when dealing with text that might have surrogate pairs, because most Java APIs (notably almost every method on String
) doesn't handle them properly.
For a better example, consider the string ""
. Six characters, right? Not according to Java!
String s ="";
System.out.println("String: " + s);
System.out.println("Length: " + s.length());
System.out.println("Chars: " + Arrays.toString(s.toCharArray()));
System.out.println("Split: " + Arrays.asList(s.split("")));
This prints:
String:
Length: 12
Chars: [?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?]
Split: [?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?]
Now, some APIs do properly handle surrogate pairs, such as StringBuilder.reverse()
:
If there are any surrogate pairs included in the sequence, these are treated as single characters for the reverse operation. Thus, the order of the high-low surrogates is never reversed.
Assuming for the sake of the interview that you can't use this method (or, understandably, you can't recall on the spot whether it's safe or not), you can iterate over the code points of a String with String.codePoints()
. This allows you to safely reverse the contents:
List<String> chars = s.codePoints()
.mapToObj(i -> String.valueOf(Character.toChars(i)))
.collect(Collectors.toList());
Collections.reverse(chars);
System.out.println(chars.stream().collect(Collectors.joining()));
Prints: