The correct way to do this is using a regex to match the entire unicode definition and use group-replacement.
The regex to match the unicode-string:
A unicode-character looks like \uABCD
, so \u
, followed by a 4-character hexnumber string. Matching these can be done using
\\u[A-Fa-f\d]{4}
But there's a problem with this:
In a String
like "just some \\uabcd arbitrary text" the \u
would still get matched. So we need to make sure the \u
is preceeded by an even number of \
s:
(?<!\\)(\\\\)*\\u[A-Fa-f\d]{4}
Now as an output, we want a backslash followed by the hexnum-part. This can be done by group-replacement, so let's get start by grouping characters:
(?<!\\)(\\\\)*(\\u)([A-Fa-f\d]{4})
As a replacement we want all backlashes from the group that matches two backslashes, followed by a backslash and the hexnum-part of the unicode-literal:
$1\\$3
Now for the actual code:
String pattern = "(?<!\\\\)(\\\\\\\\)*(\\\\u)([A-Fa-f\\d]{4})";
String replace = "$1\\\\$3";
Matcher match = Pattern.compile(pattern).matcher(test);
String result = match.replaceAll(replace);
That's a lot of backslashes! Well, there's an issue with java, regex and backslash: backslashes need to be escaped in java and regex. So "\\\\" as a pattern-string in java matches one \ as regex-matched character.
EDIT:
On actual strings, the characters need to be filtered out and be replaced by their integer-representation:
StringBuilder sb = new StringBuilder();
for(char c : in.toCharArray())
if(c > 127)
sb.append("\\").append(String.format("%04x", (int) c));
else
sb.append(c);
This assumes by "unicode-character" you mean non-ASCII-characters. This code will print any ASCII-character as is and output all other characters as backslash followed by their unicode-code. The definition "unicode-character" is rather vague though, as char
in java always represents unicode-characters. This approach preserves any control-chars like "\n", "\r", etc., which is why I chose it over other definitions.